All of lore.kernel.org
 help / color / mirror / Atom feed
* Running an active/active firewall/router (xt_cluster?)
@ 2021-05-09 17:52 Oliver Freyermuth
  2021-05-10 16:57 ` Paul Robert Marino
  2021-05-10 22:19 ` Pablo Neira Ayuso
  0 siblings, 2 replies; 11+ messages in thread
From: Oliver Freyermuth @ 2021-05-09 17:52 UTC (permalink / raw)
  To: netfilter

[-- Attachment #1: Type: text/plain, Size: 3326 bytes --]

Dear netfilter experts,

we are trying to setup an active/active firewall, making use of "xt_cluster".
We can configure the switch to act like a hub, i.e. both machines can share the same MAC and IP and get the same packets without additional ARPtables tricks.

So we set rules like:

  iptables -I PREROUTING -t mangle -i external_interface -m cluster --cluster-total-nodes 2 --cluster-local-node 1 --cluster-hash-seed 0xdeadbeef -j MARK --set-mark 0xffff
  iptables -A PREROUTING -t mangle -i external_interface -m mark ! --mark 0xffff -j DROP

Ideally, it we'd love to have the possibility to scale this to more than two nodes, but let's stay with two for now.

Basic tests show that this works as expected, but the details get messy.

1. Certainly, conntrackd is needed to synchronize connection states.
    But is it always "fast enough"?
    xt_cluster seems to match by the src_ip of the original direction of the flow[0] (if I read the code correctly),
    but what happens if the reply to an outgoing packet arrives at both firewalls before state is synchronized?
    We are currently using conntrackd in FTFW mode with a direct link, set "DisableExternalCache", and additonally set "PollSecs 15" since without that it seems
    only new and destroyed connections are synced, but lifetime updates for existing connections do not propagate without polling.
    Maybe another way which e.g. may use XOR(src,dst) might work around tight synchronization requirements, or is it possible to always uses the "internal" source IP?
    Is anybody doing that with a custom BPF?

2. How to do failover in such cases?
    For failover we'd need to change these rules (if one node fails, the total-nodes will change).
    As an alternative, I found [1] which states multiple rules can be used and enabled / disabled,
    but does somebody know of a cleaner (and easier to read) way, also not costing extra performance?

3. We have several internal networks, which need to talk to each other (partially with firewall rules and NATting),
    so we'd also need similar rules there, complicating things more. That's why a cleaner way would be very welcome :-).

4. Another point is how to actually perform the failover. Classical cluster suites (corosync + pacemaker)
    are rather used to migrate services, but not to communicate node ids and number of total active nodes.
    They can probably be tricked into doing that somehow, but they are not designed this way.
    TIPC may be something to use here, but I found nothing "ready to use".

You may also tell me there's a better way to do this than use xt_cluster (custom BPF?) — we've up to now only done "classic" active/passive setups,
but maybe someone on this list has already done active/active without commercial hardware, and can share experience from this?

Cheers and thanks in advance,
	Oliver

PS: Please keep me in CC, I'm not subscribed to the list. Thanks!

[0] https://github.com/torvalds/linux/blob/10a3efd0fee5e881b1866cf45950808575cb0f24/net/netfilter/xt_cluster.c#L16-L19
[1] https://lore.kernel.org/netfilter-devel/499BEBBF.7080705@netfilter.org/

-- 
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5432 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Running an active/active firewall/router (xt_cluster?)
  2021-05-09 17:52 Running an active/active firewall/router (xt_cluster?) Oliver Freyermuth
@ 2021-05-10 16:57 ` Paul Robert Marino
  2021-05-10 21:55   ` Oliver Freyermuth
  2021-05-10 22:19 ` Pablo Neira Ayuso
  1 sibling, 1 reply; 11+ messages in thread
From: Paul Robert Marino @ 2021-05-10 16:57 UTC (permalink / raw)
  To: Oliver Freyermuth; +Cc: netfilter

hey Oliver,
I've done similar things over the years, a lot of fun lab experiments
and found it really it comes down to a couple of things.
I did some POC testing with contrackd and some experimental code
around trunking across multiple firewalls with a sprinkling of
virtualization.
There were a few scenarios I tried, some involving OpenVSwitch
(because I was experimenting with SPBM) and some not, with contrackd
similarly configured.
All the scenarios were interesting but they all had relatively rare on
slow (<1Gbs) network issues that grew exponentially in frequency on
higher speed networks (>1Gbs) with latency in contrackd syncing.

What i found is the best scenario was to use Quagga for dynamic
routing to load balance the traffic between the firewall IP's,
keepalived to load handle IP failover, and contrackd (in a similar
configuration to the one you described) to keep the states in sync
there are a few pitfalls in going down this route caused by bad and or
outdated documentation for both Quagga and keepalived. I'm also going
to give you some recommendations about some hardware topology stuff
you may not think about initially.

I will start with Quagga because the bad documentation part is easy to cover.
in the Quagga documentation they recommend that you put a routable IP
on a loopback interface and attach Quagga the daemon for the dynamic
routing service of your choice to it, That works fine on BSD and old
versions of Linux from 20 years ago but any thing running a Linux
kernel version of 2.4 or higher will not allow it unless you change
setting in /etc/sysctrl.conf and the Quagga documentation tells you to
make those changes. DO NOT DO WHAT THEY SAY, its wrong and dangerous.
Instead create a "dummy" interface with a routable IP for this
purpose. a dummy interface is a special kind of interface meant for
exactly the scenario described and works well without compromising the
security of your firewall.

Keepalived
the main error in keepalived's documentation is is most of the
documentation and howto's you will find about it on the web are based
on a 15 year old howto which had a fundamental mistake in how VRRP
works, and what the "state"  flag actually does because its not
explained well in the man file. "state" in a "vrrp_instance" should
always be set to "MASTER" on all nodes and the priority should be used
to determine which node should be the preferred master. the only time
you should ever set state to "BACKUP" is if you have a 3rd machine
that you never want to become the master which you are just using for
quorum and in that case its priority should also be set to "0"
(failed) . setting the state to "BACKUP" will seem to work fine until
you have a failover event when the interface will continually go ip
and done on the backup node. on the mac address issue keepalived will
apr ping the subnets its attached to so that's generally not an issue
but I would recommend using vmac's (virtual mac addresses) assuming
the kernel for your distro and your network cards support it because
that way it just looks to the switch like it changed a port due to
some physical topology change and switches usually handle that very
gracefully, but don't always handle the mac address change for IP
addresses as quickly.
I also recommend reading the RFC's on VRRP particularly the parts that
explain how the elections and priorities work, they are a quick and
easy read and will really give you a good idea of how to configure
keepalived properly to achieve the failover and recovery behavior you
want.

On the hardware topology
I recommend using dedicated interfaces for contrackd, really you don't
need anything faster than 100Mbps even if the data interfaces are
100Gbps but i usually use 1 Gbps interfaces for this. they can be on
their own dedicated switches or crossover interfaces. the main concern
here is securely handling a large number of tiny packets so having
dedicated network card buffers to handle microburst  is useful and if
you can avoid latency from a switch that's trying to be too smart for
its own good that's for the best.
For keepalived use dedicated VLAN's on each physical interface to
handle the heartbeats and group the VRRP interfaces. to insure the
failovers of the IP's on both sides are handled correctly.
If you only have 2 firewalls I recommend using a an additional device
on each side for quorum in a backup/failed mode as described above.
Assuming a 1 second or greater interval the device could be something
as simple as a Raspberry PI it really doesn't need to be anything
powerful because its just adding a heartbeat to the cluster, but for
sub second intervals you may need something more powerful because sub
second intervals can eat a surprising amount of CPU.


On Sun, May 9, 2021 at 3:16 PM Oliver Freyermuth
<freyermuth@physik.uni-bonn.de> wrote:
>
> Dear netfilter experts,
>
> we are trying to setup an active/active firewall, making use of "xt_cluster".
> We can configure the switch to act like a hub, i.e. both machines can share the same MAC and IP and get the same packets without additional ARPtables tricks.
>
> So we set rules like:
>
>   iptables -I PREROUTING -t mangle -i external_interface -m cluster --cluster-total-nodes 2 --cluster-local-node 1 --cluster-hash-seed 0xdeadbeef -j MARK --set-mark 0xffff
>   iptables -A PREROUTING -t mangle -i external_interface -m mark ! --mark 0xffff -j DROP
>
> Ideally, it we'd love to have the possibility to scale this to more than two nodes, but let's stay with two for now.
>
> Basic tests show that this works as expected, but the details get messy.
>
> 1. Certainly, conntrackd is needed to synchronize connection states.
>     But is it always "fast enough"?
>     xt_cluster seems to match by the src_ip of the original direction of the flow[0] (if I read the code correctly),
>     but what happens if the reply to an outgoing packet arrives at both firewalls before state is synchronized?
>     We are currently using conntrackd in FTFW mode with a direct link, set "DisableExternalCache", and additonally set "PollSecs 15" since without that it seems
>     only new and destroyed connections are synced, but lifetime updates for existing connections do not propagate without polling.
>     Maybe another way which e.g. may use XOR(src,dst) might work around tight synchronization requirements, or is it possible to always uses the "internal" source IP?
>     Is anybody doing that with a custom BPF?
>
> 2. How to do failover in such cases?
>     For failover we'd need to change these rules (if one node fails, the total-nodes will change).
>     As an alternative, I found [1] which states multiple rules can be used and enabled / disabled,
>     but does somebody know of a cleaner (and easier to read) way, also not costing extra performance?
>
> 3. We have several internal networks, which need to talk to each other (partially with firewall rules and NATting),
>     so we'd also need similar rules there, complicating things more. That's why a cleaner way would be very welcome :-).
>
> 4. Another point is how to actually perform the failover. Classical cluster suites (corosync + pacemaker)
>     are rather used to migrate services, but not to communicate node ids and number of total active nodes.
>     They can probably be tricked into doing that somehow, but they are not designed this way.
>     TIPC may be something to use here, but I found nothing "ready to use".
>
> You may also tell me there's a better way to do this than use xt_cluster (custom BPF?) — we've up to now only done "classic" active/passive setups,
> but maybe someone on this list has already done active/active without commercial hardware, and can share experience from this?
>
> Cheers and thanks in advance,
>         Oliver
>
> PS: Please keep me in CC, I'm not subscribed to the list. Thanks!
>
> [0] https://github.com/torvalds/linux/blob/10a3efd0fee5e881b1866cf45950808575cb0f24/net/netfilter/xt_cluster.c#L16-L19
> [1] https://lore.kernel.org/netfilter-devel/499BEBBF.7080705@netfilter.org/
>
> --
> Oliver Freyermuth
> Universität Bonn
> Physikalisches Institut, Raum 1.047
> Nußallee 12
> 53115 Bonn
> --
> Tel.: +49 228 73 2367
> Fax:  +49 228 73 7869
> --
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Running an active/active firewall/router (xt_cluster?)
  2021-05-10 16:57 ` Paul Robert Marino
@ 2021-05-10 21:55   ` Oliver Freyermuth
  2021-05-10 22:55     ` Paul Robert Marino
  0 siblings, 1 reply; 11+ messages in thread
From: Oliver Freyermuth @ 2021-05-10 21:55 UTC (permalink / raw)
  To: Paul Robert Marino; +Cc: netfilter

[-- Attachment #1: Type: text/plain, Size: 12126 bytes --]

Hey Paul,

many thanks for the detailed reply!
Some comments inline.

Am 10.05.21 um 18:57 schrieb Paul Robert Marino:
> hey Oliver,
> I've done similar things over the years, a lot of fun lab experiments
> and found it really it comes down to a couple of things.
> I did some POC testing with contrackd and some experimental code
> around trunking across multiple firewalls with a sprinkling of
> virtualization.
> There were a few scenarios I tried, some involving OpenVSwitch
> (because I was experimenting with SPBM) and some not, with contrackd
> similarly configured.
> All the scenarios were interesting but they all had relatively rare on
> slow (<1Gbs) network issues that grew exponentially in frequency on
> higher speed networks (>1Gbs) with latency in contrackd syncing.

Indeed, we'd strive for ~20 Gb/s in our case, so this experience surely is important to hear about.

> What i found is the best scenario was to use Quagga for dynamic
> routing to load balance the traffic between the firewall IP's,
> keepalived to load handle IP failover, and contrackd (in a similar
> configuration to the one you described) to keep the states in sync
> there are a few pitfalls in going down this route caused by bad and or
> outdated documentation for both Quagga and keepalived. I'm also going
> to give you some recommendations about some hardware topology stuff
> you may not think about initially.

I'm still a bit unsure if we are on the same page, but that may just be caused by my limited knowledge of Quagga.
To my understanding, Quagga uses e.g. OSPF and hence can, if the routes have the same path, load-balance.

However, in our case, we'd want to go for active/active firewalls (which of course are also routers).
But that means we have internal machines on one side, which use a single default gateway (per VLAN),
then our active/active firewall, and then the outside world (actually a PtP connection to an upstream router).

Can Quagga help me to actively use both firewalls in a load-balancing and redundant way?
The idea here is that the upstream router has high bandwidth, so using more than one firewall allows to achieve better throughput,
and with active/active we'd also strive for redundancy (i.e. reduced throughput if one firewall fails).
To my understanding, OSPF / Quagga could do this if the firewalls are placed between routers also joining via OSPF.
But is there also a way to have the clients directly talk to our firewalls, and the firewalls to a single upstream router (which we don't control)?

A simple drawing may help:

              ____  FW A ____
             /               \
Client(s) --                 --PtP-- upstream router
             \____  FW B ____/

This is why I thought about using xt_cluster and giving both FW A and FW B the very same IP (the default gateway of the clients)
and the very same MAC at the same time, so the switch duplicates the packets, and then FW A accepts some packets and FW B the remaining ones
via filtering with xt_cluster.

Can Quagga do something in this picture, or simplify this picture?
The upstream router also sends all incoming packets to a single IP in the PtP network, i.e. the firewall nodes need to show up as "one converged system"
to both the clients on one side and the upstream router on the other side.

> I will start with Quagga because the bad documentation part is easy to cover.
> in the Quagga documentation they recommend that you put a routable IP
> on a loopback interface and attach Quagga the daemon for the dynamic
> routing service of your choice to it, That works fine on BSD and old
> versions of Linux from 20 years ago but any thing running a Linux
> kernel version of 2.4 or higher will not allow it unless you change
> setting in /etc/sysctrl.conf and the Quagga documentation tells you to
> make those changes. DO NOT DO WHAT THEY SAY, its wrong and dangerous.
> Instead create a "dummy" interface with a routable IP for this
> purpose. a dummy interface is a special kind of interface meant for
> exactly the scenario described and works well without compromising the
> security of your firewall.

Thanks for this helpful advice!
Even though I am not sure yet Quagga will help me out in this picture,
I am now already convinced we will have a situation in which Quagga will help us out.
So this is noted down for future use :-).

> Keepalived
> the main error in keepalived's documentation is is most of the
> documentation and howto's you will find about it on the web are based
> on a 15 year old howto which had a fundamental mistake in how VRRP
> works, and what the "state"  flag actually does because its not
> explained well in the man file. "state" in a "vrrp_instance" should
> always be set to "MASTER" on all nodes and the priority should be used
> to determine which node should be the preferred master. the only time
> you should ever set state to "BACKUP" is if you have a 3rd machine
> that you never want to become the master which you are just using for
> quorum and in that case its priority should also be set to "0"
> (failed) . setting the state to "BACKUP" will seem to work fine until
> you have a failover event when the interface will continually go ip
> and done on the backup node. on the mac address issue keepalived will
> apr ping the subnets its attached to so that's generally not an issue
> but I would recommend using vmac's (virtual mac addresses) assuming
> the kernel for your distro and your network cards support it because
> that way it just looks to the switch like it changed a port due to
> some physical topology change and switches usually handle that very
> gracefully, but don't always handle the mac address change for IP
> addresses as quickly.
> I also recommend reading the RFC's on VRRP particularly the parts that
> explain how the elections and priorities work, they are a quick and
> easy read and will really give you a good idea of how to configure
> keepalived properly to achieve the failover and recovery behavior you
> want.

See above on the virtual MACs — if the clients should use both firewalls at the same time,
I think I'd need a single MAC for both, so the clients only see a single default gateway.
In a more classic setup, we've used pcs (pacemaker and corosync) to successfully migrate virtual IPs and MAC addresses.
It has worked quite reliable (using Kronosnet for communication).
But we've also used Keepalived some years ago successfully :-).

> On the hardware topology
> I recommend using dedicated interfaces for contrackd, really you don't
> need anything faster than 100Mbps even if the data interfaces are
> 100Gbps but i usually use 1 Gbps interfaces for this. they can be on
> their own dedicated switches or crossover interfaces. the main concern
> here is securely handling a large number of tiny packets so having
> dedicated network card buffers to handle microburst  is useful and if
> you can avoid latency from a switch that's trying to be too smart for
> its own good that's for the best.

Indeed, we have 1 Gb/s crossover link, and use a 1 Gb/s connection through a switch in case this would ever fail for some reason —
we use these links both for conntrackd and for Kronosnet communication by corosync.

> For keepalived use dedicated VLAN's on each physical interface to
> handle the heartbeats and group the VRRP interfaces. to insure the
> failovers of the IP's on both sides are handled correctly.
> If you only have 2 firewalls I recommend using a an additional device
> on each side for quorum in a backup/failed mode as described above.
> Assuming a 1 second or greater interval the device could be something
> as simple as a Raspberry PI it really doesn't need to be anything
> powerful because its just adding a heartbeat to the cluster, but for
> sub second intervals you may need something more powerful because sub
> second intervals can eat a surprising amount of CPU.

We currently went without an external third party and let corosync/pacemaker use a STONITH device to explicitly kill the other node
and establish a defined state if heartbeats get lost. We might think about a third machine at some point to get an actual quorum, indeed.

Cheers and thanks again,
	Oliver

> 
> 
> On Sun, May 9, 2021 at 3:16 PM Oliver Freyermuth
> <freyermuth@physik.uni-bonn.de> wrote:
>>
>> Dear netfilter experts,
>>
>> we are trying to setup an active/active firewall, making use of "xt_cluster".
>> We can configure the switch to act like a hub, i.e. both machines can share the same MAC and IP and get the same packets without additional ARPtables tricks.
>>
>> So we set rules like:
>>
>>    iptables -I PREROUTING -t mangle -i external_interface -m cluster --cluster-total-nodes 2 --cluster-local-node 1 --cluster-hash-seed 0xdeadbeef -j MARK --set-mark 0xffff
>>    iptables -A PREROUTING -t mangle -i external_interface -m mark ! --mark 0xffff -j DROP
>>
>> Ideally, it we'd love to have the possibility to scale this to more than two nodes, but let's stay with two for now.
>>
>> Basic tests show that this works as expected, but the details get messy.
>>
>> 1. Certainly, conntrackd is needed to synchronize connection states.
>>      But is it always "fast enough"?
>>      xt_cluster seems to match by the src_ip of the original direction of the flow[0] (if I read the code correctly),
>>      but what happens if the reply to an outgoing packet arrives at both firewalls before state is synchronized?
>>      We are currently using conntrackd in FTFW mode with a direct link, set "DisableExternalCache", and additonally set "PollSecs 15" since without that it seems
>>      only new and destroyed connections are synced, but lifetime updates for existing connections do not propagate without polling.
>>      Maybe another way which e.g. may use XOR(src,dst) might work around tight synchronization requirements, or is it possible to always uses the "internal" source IP?
>>      Is anybody doing that with a custom BPF?
>>
>> 2. How to do failover in such cases?
>>      For failover we'd need to change these rules (if one node fails, the total-nodes will change).
>>      As an alternative, I found [1] which states multiple rules can be used and enabled / disabled,
>>      but does somebody know of a cleaner (and easier to read) way, also not costing extra performance?
>>
>> 3. We have several internal networks, which need to talk to each other (partially with firewall rules and NATting),
>>      so we'd also need similar rules there, complicating things more. That's why a cleaner way would be very welcome :-).
>>
>> 4. Another point is how to actually perform the failover. Classical cluster suites (corosync + pacemaker)
>>      are rather used to migrate services, but not to communicate node ids and number of total active nodes.
>>      They can probably be tricked into doing that somehow, but they are not designed this way.
>>      TIPC may be something to use here, but I found nothing "ready to use".
>>
>> You may also tell me there's a better way to do this than use xt_cluster (custom BPF?) — we've up to now only done "classic" active/passive setups,
>> but maybe someone on this list has already done active/active without commercial hardware, and can share experience from this?
>>
>> Cheers and thanks in advance,
>>          Oliver
>>
>> PS: Please keep me in CC, I'm not subscribed to the list. Thanks!
>>
>> [0] https://github.com/torvalds/linux/blob/10a3efd0fee5e881b1866cf45950808575cb0f24/net/netfilter/xt_cluster.c#L16-L19
>> [1] https://lore.kernel.org/netfilter-devel/499BEBBF.7080705@netfilter.org/
>>
>> --
>> Oliver Freyermuth
>> Universität Bonn
>> Physikalisches Institut, Raum 1.047
>> Nußallee 12
>> 53115 Bonn
>> --
>> Tel.: +49 228 73 2367
>> Fax:  +49 228 73 7869
>> --
>>


-- 
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5432 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Running an active/active firewall/router (xt_cluster?)
  2021-05-09 17:52 Running an active/active firewall/router (xt_cluster?) Oliver Freyermuth
  2021-05-10 16:57 ` Paul Robert Marino
@ 2021-05-10 22:19 ` Pablo Neira Ayuso
  2021-05-10 22:58   ` Oliver Freyermuth
  1 sibling, 1 reply; 11+ messages in thread
From: Pablo Neira Ayuso @ 2021-05-10 22:19 UTC (permalink / raw)
  To: Oliver Freyermuth; +Cc: netfilter

[-- Attachment #1: Type: text/plain, Size: 3862 bytes --]

Hi,

On Sun, May 09, 2021 at 07:52:27PM +0200, Oliver Freyermuth wrote:
> Dear netfilter experts,
> 
> we are trying to setup an active/active firewall, making use of
> "xt_cluster".  We can configure the switch to act like a hub, i.e.
> both machines can share the same MAC and IP and get the same packets
> without additional ARPtables tricks.
> 
> So we set rules like:
> 
>  iptables -I PREROUTING -t mangle -i external_interface -m cluster --cluster-total-nodes 2 --cluster-local-node 1 --cluster-hash-seed 0xdeadbeef -j MARK --set-mark 0xffff
>  iptables -A PREROUTING -t mangle -i external_interface -m mark ! --mark 0xffff -j DROP

I'm attaching an old script to set up active-active I remember to have
used time ago, I never found the time to upstream this.

> Ideally, it we'd love to have the possibility to scale this to more
> than two nodes, but let's stay with two for now.

IIRC, up to two nodes should be easy with the existing codebase. To
support more than 2 nodes, conntrackd needs to be extended, but it
should be doable.

> Basic tests show that this works as expected, but the details get messy.
> 
> 1. Certainly, conntrackd is needed to synchronize connection states.
>    But is it always "fast enough"?  xt_cluster seems to match by the
>    src_ip of the original direction of the flow[0] (if I read the code
>    correctly), but what happens if the reply to an outgoing packet
>    arrives at both firewalls before state is synchronized?

You can avoid this by setting DisableExternalCache to off. Then, in
case one of your firewall node goes off, update the cluster rules and
inject the entries (via keepalived, or your HA daemon of choice).

Recommended configuration is DisableExternalCache off and properly
configure your HA daemon to assist conntrackd. Then, the conntrack
entries in the "external cache" of conntrackd are added to the kernel
when needed.

>    We are currently using conntrackd in FTFW mode with a direct
>    link, set "DisableExternalCache", and additonally set "PollSecs
>    15" since without that it seems only new and destroyed
>    connections are synced, but lifetime updates for existing
>    connections do not propagate without polling.

No need to set on PollSecs. Polling should be disabled. Did you enable
event filtering? You should synchronize receive update too. Could you
post your configuration file?

[...]
> 2. How to do failover in such cases?
>    For failover we'd need to change these rules (if one node fails,
>    the total-nodes will change).  As an alternative, I found [1]
>    which states multiple rules can be used and enabled / disabled,
>    but does somebody know of a cleaner (and easier to read) way,
>    also not costing extra performance?

If you use iptables, you'll have to update the rules on failure as you
describe. What performance cost are you refering to?

> 3. We have several internal networks, which need to talk to each
>    other (partially with firewall rules and NATting), so we'd also need
>    similar rules there, complicating things more. That's why a cleaner
>    way would be very welcome :-).

Cleaner way, it should be possible to simplify this setup with
nftables.

> 4. Another point is how to actually perform the failover. Classical
>    cluster suites (corosync + pacemaker) are rather used to migrate
>    services, but not to communicate node ids and number of total active
>    nodes.  They can probably be tricked into doing that somehow, but
>    they are not designed this way.  TIPC may be something to use here,
>    but I found nothing "ready to use".

I have used keepalived in the past with very simple configuration
files, and use their shell script API to interact with conntrackd.
I did not spend much time on corosync/pacemaker so far.

[...]

[-- Attachment #2: cluster-node1.sh --]
[-- Type: application/x-sh, Size: 5890 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Running an active/active firewall/router (xt_cluster?)
  2021-05-10 21:55   ` Oliver Freyermuth
@ 2021-05-10 22:55     ` Paul Robert Marino
  2021-05-10 23:21       ` Oliver Freyermuth
  0 siblings, 1 reply; 11+ messages in thread
From: Paul Robert Marino @ 2021-05-10 22:55 UTC (permalink / raw)
  To: Oliver Freyermuth; +Cc: netfilter

I'm adding replies to your replies inline below

On Mon, May 10, 2021, 5:55 PM Oliver Freyermuth
<freyermuth@physik.uni-bonn.de> wrote:
>
> Hey Paul,
>
> many thanks for the detailed reply!
> Some comments inline.
>
> Am 10.05.21 um 18:57 schrieb Paul Robert Marino:
> > hey Oliver,
> > I've done similar things over the years, a lot of fun lab experiments
> > and found it really it comes down to a couple of things.
> > I did some POC testing with contrackd and some experimental code
> > around trunking across multiple firewalls with a sprinkling of
> > virtualization.
> > There were a few scenarios I tried, some involving OpenVSwitch
> > (because I was experimenting with SPBM) and some not, with contrackd
> > similarly configured.
> > All the scenarios were interesting but they all had relatively rare on
> > slow (<1Gbs) network issues that grew exponentially in frequency on
> > higher speed networks (>1Gbs) with latency in contrackd syncing.
>
> Indeed, we'd strive for ~20 Gb/s in our case, so this experience surely is important to hear about.
>
> > What i found is the best scenario was to use Quagga for dynamic
> > routing to load balance the traffic between the firewall IP's,
> > keepalived to load handle IP failover, and contrackd (in a similar
> > configuration to the one you described) to keep the states in sync
> > there are a few pitfalls in going down this route caused by bad and or
> > outdated documentation for both Quagga and keepalived. I'm also going
> > to give you some recommendations about some hardware topology stuff
> > you may not think about initially.
>
> I'm still a bit unsure if we are on the same page, but that may just be caused by my limited knowledge of Quagga.
> To my understanding, Quagga uses e.g. OSPF and hence can, if the routes have the same path, load-balance.
>
> However, in our case, we'd want to go for active/active firewalls (which of course are also routers).
> But that means we have internal machines on one side, which use a single default gateway (per VLAN),
> then our active/active firewall, and then the outside world (actually a PtP connection to an upstream router).
>
> Can Quagga help me to actively use both firewalls in a load-balancing and redundant way?
> The idea here is that the upstream router has high bandwidth, so using more than one firewall allows to achieve better throughput,
> and with active/active we'd also strive for redundancy (i.e. reduced throughput if one firewall fails).
> To my understanding, OSPF / Quagga could do this if the firewalls are placed between routers also joining via OSPF.
> But is there also a way to have the clients directly talk to our firewalls, and the firewalls to a single upstream router (which we don't control)?
>
> A simple drawing may help:
>
>               ____  FW A ____
>              /               \
> Client(s) --                 --PtP-- upstream router
>              \____  FW B ____/
>
> This is why I thought about using xt_cluster and giving both FW A and FW B the very same IP (the default gateway of the clients)
> and the very same MAC at the same time, so the switch duplicates the packets, and then FW A accepts some packets and FW B the remaining ones
> via filtering with xt_cluster.
>
> Can Quagga do something in this picture, or simplify this picture?
> The upstream router also sends all incoming packets to a single IP in the PtP network, i.e. the firewall nodes need to show up as "one converged system"
> to both the clients on one side and the upstream router on the other side.


I understand what you are shooting for but it's dangerous at those
data rates and not achievable via stock existing software.
I did write some POC code years ago for a previous employer but
determined it was too dangerous to put into production without some
massive kernel changes such as using something like RDMA over
dedicated high speed interfaces or linking the systems over the PCI
express busses to sync the states instead of using contrackd.

So load balancing is a better choice in this case, and many middle to
higher end managed switches that have routers built in can do OSPF.
I've seen many stackable switches that can do it. By the way Quagga
supports several other dynamic routing protocols not just just OSPF.

The safest and easiest option for you would be to use 100Gbs fibre
connection instead, possibly with direct attach cables if you want to
save on optics, and do primary secondary failover.



> > I will start with Quagga because the bad documentation part is easy to cover.
> > in the Quagga documentation they recommend that you put a routable IP
> > on a loopback interface and attach Quagga the daemon for the dynamic
> > routing service of your choice to it, That works fine on BSD and old
> > versions of Linux from 20 years ago but any thing running a Linux
> > kernel version of 2.4 or higher will not allow it unless you change
> > setting in /etc/sysctrl.conf and the Quagga documentation tells you to
> > make those changes. DO NOT DO WHAT THEY SAY, its wrong and dangerous.
> > Instead create a "dummy" interface with a routable IP for this
> > purpose. a dummy interface is a special kind of interface meant for
> > exactly the scenario described and works well without compromising the
> > security of your firewall.
>
> Thanks for this helpful advice!
> Even though I am not sure yet Quagga will help me out in this picture,
> I am now already convinced we will have a situation in which Quagga will help us out.
> So this is noted down for future use :-).
>
> > Keepalived
> > the main error in keepalived's documentation is is most of the
> > documentation and howto's you will find about it on the web are based
> > on a 15 year old howto which had a fundamental mistake in how VRRP
> > works, and what the "state"  flag actually does because its not
> > explained well in the man file. "state" in a "vrrp_instance" should
> > always be set to "MASTER" on all nodes and the priority should be used
> > to determine which node should be the preferred master. the only time
> > you should ever set state to "BACKUP" is if you have a 3rd machine
> > that you never want to become the master which you are just using for
> > quorum and in that case its priority should also be set to "0"
> > (failed) . setting the state to "BACKUP" will seem to work fine until
> > you have a failover event when the interface will continually go ip
> > and done on the backup node. on the mac address issue keepalived will
> > apr ping the subnets its attached to so that's generally not an issue
> > but I would recommend using vmac's (virtual mac addresses) assuming
> > the kernel for your distro and your network cards support it because
> > that way it just looks to the switch like it changed a port due to
> > some physical topology change and switches usually handle that very
> > gracefully, but don't always handle the mac address change for IP
> > addresses as quickly.
> > I also recommend reading the RFC's on VRRP particularly the parts that
> > explain how the elections and priorities work, they are a quick and
> > easy read and will really give you a good idea of how to configure
> > keepalived properly to achieve the failover and recovery behavior you
> > want.
>
> See above on the virtual MACs — if the clients should use both firewalls at the same time,
> I think I'd need a single MAC for both, so the clients only see a single default gateway.
> In a more classic setup, we've used pcs (pacemaker and corosync) to successfully migrate virtual IPs and MAC addresses.
> It has worked quite reliable (using Kronosnet for communication).
> But we've also used Keepalived some years ago successfully :-).
>
> > On the hardware topology
> > I recommend using dedicated interfaces for contrackd, really you don't
> > need anything faster than 100Mbps even if the data interfaces are
> > 100Gbps but i usually use 1 Gbps interfaces for this. they can be on
> > their own dedicated switches or crossover interfaces. the main concern
> > here is securely handling a large number of tiny packets so having
> > dedicated network card buffers to handle microburst  is useful and if
> > you can avoid latency from a switch that's trying to be too smart for
> > its own good that's for the best.
>
> Indeed, we have 1 Gb/s crossover link, and use a 1 Gb/s connection through a switch in case this would ever fail for some reason —
> we use these links both for conntrackd and for Kronosnet communication by corosync.
>
> > For keepalived use dedicated VLAN's on each physical interface to
> > handle the heartbeats and group the VRRP interfaces. to insure the
> > failovers of the IP's on both sides are handled correctly.
> > If you only have 2 firewalls I recommend using a an additional device
> > on each side for quorum in a backup/failed mode as described above.
> > Assuming a 1 second or greater interval the device could be something
> > as simple as a Raspberry PI it really doesn't need to be anything
> > powerful because its just adding a heartbeat to the cluster, but for
> > sub second intervals you may need something more powerful because sub
> > second intervals can eat a surprising amount of CPU.
>
> We currently went without an external third party and let corosync/pacemaker use a STONITH device to explicitly kill the other node
> and establish a defined state if heartbeats get lost. We might think about a third machine at some point to get an actual quorum, indeed.


I get why you might think to use corosync/pacemaker for this if you
weren't familiar with keepalived and LVS in the kernel,  but it's
hammering a square peg in a round hole when you have a perfectly
shaped and sized peg available to you that's actually been around a
lot longer and works a lot more predictably, faster and more reliably
by leveraging parts of the kernels network stack designed specifically
for this use case. I've done explicit kills of the other device via
cross connected hardware watchdog devices via keepalived before and it
was easy.
By the way if you don't know what LVS is it's the kernels builtin
layer 3 network load balancer stack that was designed with these kind
of failover scenarios in mind keepalived is just a wrapper around LVS
that adds VRRP based heartbeating and hooks to allow you to call
external scripts for actions based on heart beat state change events
and additional watchdog scripts which can also trigger state changes.
To be clear i wouldn't use keepalived to handle process master slave
failovers i would use corosync and pacemaker, or in some cases
Clusterd for that because they are usually the right tool for the job,
but for firewall and or network load balancer failover i would always
use keepalived because its the right tool for that job.


>
> Cheers and thanks again,
>         Oliver
>
> >
> >
> > On Sun, May 9, 2021 at 3:16 PM Oliver Freyermuth
> > <freyermuth@physik.uni-bonn.de> wrote:
> >>
> >> Dear netfilter experts,
> >>
> >> we are trying to setup an active/active firewall, making use of "xt_cluster".
> >> We can configure the switch to act like a hub, i.e. both machines can share the same MAC and IP and get the same packets without additional ARPtables tricks.
> >>
> >> So we set rules like:
> >>
> >>    iptables -I PREROUTING -t mangle -i external_interface -m cluster --cluster-total-nodes 2 --cluster-local-node 1 --cluster-hash-seed 0xdeadbeef -j MARK --set-mark 0xffff
> >>    iptables -A PREROUTING -t mangle -i external_interface -m mark ! --mark 0xffff -j DROP
> >>
> >> Ideally, it we'd love to have the possibility to scale this to more than two nodes, but let's stay with two for now.
> >>
> >> Basic tests show that this works as expected, but the details get messy.
> >>
> >> 1. Certainly, conntrackd is needed to synchronize connection states.
> >>      But is it always "fast enough"?
> >>      xt_cluster seems to match by the src_ip of the original direction of the flow[0] (if I read the code correctly),
> >>      but what happens if the reply to an outgoing packet arrives at both firewalls before state is synchronized?
> >>      We are currently using conntrackd in FTFW mode with a direct link, set "DisableExternalCache", and additonally set "PollSecs 15" since without that it seems
> >>      only new and destroyed connections are synced, but lifetime updates for existing connections do not propagate without polling.
> >>      Maybe another way which e.g. may use XOR(src,dst) might work around tight synchronization requirements, or is it possible to always uses the "internal" source IP?
> >>      Is anybody doing that with a custom BPF?
> >>
> >> 2. How to do failover in such cases?
> >>      For failover we'd need to change these rules (if one node fails, the total-nodes will change).
> >>      As an alternative, I found [1] which states multiple rules can be used and enabled / disabled,
> >>      but does somebody know of a cleaner (and easier to read) way, also not costing extra performance?
> >>
> >> 3. We have several internal networks, which need to talk to each other (partially with firewall rules and NATting),
> >>      so we'd also need similar rules there, complicating things more. That's why a cleaner way would be very welcome :-).
> >>
> >> 4. Another point is how to actually perform the failover. Classical cluster suites (corosync + pacemaker)
> >>      are rather used to migrate services, but not to communicate node ids and number of total active nodes.
> >>      They can probably be tricked into doing that somehow, but they are not designed this way.
> >>      TIPC may be something to use here, but I found nothing "ready to use".
> >>
> >> You may also tell me there's a better way to do this than use xt_cluster (custom BPF?) — we've up to now only done "classic" active/passive setups,
> >> but maybe someone on this list has already done active/active without commercial hardware, and can share experience from this?
> >>
> >> Cheers and thanks in advance,
> >>          Oliver
> >>
> >> PS: Please keep me in CC, I'm not subscribed to the list. Thanks!
> >>
> >> [0] https://github.com/torvalds/linux/blob/10a3efd0fee5e881b1866cf45950808575cb0f24/net/netfilter/xt_cluster.c#L16-L19
> >> [1] https://lore.kernel.org/netfilter-devel/499BEBBF.7080705@netfilter.org/
> >>
> >> --
> >> Oliver Freyermuth
> >> Universität Bonn
> >> Physikalisches Institut, Raum 1.047
> >> Nußallee 12
> >> 53115 Bonn
> >> --
> >> Tel.: +49 228 73 2367
> >> Fax:  +49 228 73 7869
> >> --
> >>
>
>
> --
> Oliver Freyermuth
> Universität Bonn
> Physikalisches Institut, Raum 1.047
> Nußallee 12
> 53115 Bonn
> --
> Tel.: +49 228 73 2367
> Fax:  +49 228 73 7869
> --
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Running an active/active firewall/router (xt_cluster?)
  2021-05-10 22:19 ` Pablo Neira Ayuso
@ 2021-05-10 22:58   ` Oliver Freyermuth
  2021-05-11  9:28     ` Oliver Freyermuth
  0 siblings, 1 reply; 11+ messages in thread
From: Oliver Freyermuth @ 2021-05-10 22:58 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter


[-- Attachment #1.1: Type: text/plain, Size: 6587 bytes --]

Hi,

many thanks for this elaborate reply!

Am 11.05.21 um 00:19 schrieb Pablo Neira Ayuso:
> Hi,
> 
> On Sun, May 09, 2021 at 07:52:27PM +0200, Oliver Freyermuth wrote:
>> Dear netfilter experts,
>>
>> we are trying to setup an active/active firewall, making use of
>> "xt_cluster".  We can configure the switch to act like a hub, i.e.
>> both machines can share the same MAC and IP and get the same packets
>> without additional ARPtables tricks.
>>
>> So we set rules like:
>>
>>   iptables -I PREROUTING -t mangle -i external_interface -m cluster --cluster-total-nodes 2 --cluster-local-node 1 --cluster-hash-seed 0xdeadbeef -j MARK --set-mark 0xffff
>>   iptables -A PREROUTING -t mangle -i external_interface -m mark ! --mark 0xffff -j DROP
> 
> I'm attaching an old script to set up active-active I remember to have
> used time ago, I never found the time to upstream this.

this is really helpful indeed.
While we use Shorewall (which simplifies many things, but has no abstraction for xt_cluster as far as I am aware of),
it helps to see all rules written up together to translate them for Shorewall, and also the debugging rules are very helpful.

> 
>> Ideally, it we'd love to have the possibility to scale this to more
>> than two nodes, but let's stay with two for now.
> 
> IIRC, up to two nodes should be easy with the existing codebase. To
> support more than 2 nodes, conntrackd needs to be extended, but it
> should be doable.
> 
>> Basic tests show that this works as expected, but the details get messy.
>>
>> 1. Certainly, conntrackd is needed to synchronize connection states.
>>     But is it always "fast enough"?  xt_cluster seems to match by the
>>     src_ip of the original direction of the flow[0] (if I read the code
>>     correctly), but what happens if the reply to an outgoing packet
>>     arrives at both firewalls before state is synchronized?
> 
> You can avoid this by setting DisableExternalCache to off. Then, in
> case one of your firewall node goes off, update the cluster rules and
> inject the entries (via keepalived, or your HA daemon of choice).
> 
> Recommended configuration is DisableExternalCache off and properly
> configure your HA daemon to assist conntrackd. Then, the conntrack
> entries in the "external cache" of conntrackd are added to the kernel
> when needed.

You caused a classic "facepalming" moment. Of course, that will solve (1)
completely. My initial thinking when disabling the external cache
was before I understood how xt_cluster works, and before I found that it uses the direction
of the flow, and then it just escaped my mind.
Thanks for clearing this up! :-)

> 
>>     We are currently using conntrackd in FTFW mode with a direct
>>     link, set "DisableExternalCache", and additonally set "PollSecs
>>     15" since without that it seems only new and destroyed
>>     connections are synced, but lifetime updates for existing
>>     connections do not propagate without polling.
> 
> No need to set on PollSecs. Polling should be disabled. Did you enable
> event filtering? You should synchronize receive update too. Could you
> post your configuration file?

Sure, it's attached — I'm doing event filtering, but only by address and protocol,
not by flow state, so I thought it to be harmless in this regard.
For my test, I just sent a continuous ICMP through the node,
and the flow itself was synced fine, but then the lifetime was not updated on the partner node unless polling was active,
and finally the flow was removed on the partner machine (lifetime expired) while it was being kept alive by updates
on the primary node.

This was with "DisableExternalCache on", on a CentOS 8.2 node, i.e.:
   Kernel 4.18.0-193.19.1.el8_2.x86_64
   conntrackd v1.4.4

> 
> [...]
>> 2. How to do failover in such cases?
>>     For failover we'd need to change these rules (if one node fails,
>>     the total-nodes will change).  As an alternative, I found [1]
>>     which states multiple rules can be used and enabled / disabled,
>>     but does somebody know of a cleaner (and easier to read) way,
>>     also not costing extra performance?
> 
> If you use iptables, you'll have to update the rules on failure as you
> describe. What performance cost are you refering to?

This was based on your comment here:
  https://lore.kernel.org/netfilter-devel/499BEBBF.7080705@netfilter.org/

But probably, this is indeed premature thinking on my end —
with two firewalls, having two rules after failover should have even less impact than what you measured there.
I still think something like the /proc interface you described there would be cleaner, but I also don't know of a failover daemon
which could make use of it.

>> 3. We have several internal networks, which need to talk to each
>>     other (partially with firewall rules and NATting), so we'd also need
>>     similar rules there, complicating things more. That's why a cleaner
>>     way would be very welcome :-).
> 
> Cleaner way, it should be possible to simplify this setup with
> nftables.

Since we currently use Shorewall as simplification layer (which eases many things by its abstraction,
but still uses iptables behind the scenes), it's probably best for sanity not to mix here.
So the less "clean" way is likely the easier one for now.

>> 4. Another point is how to actually perform the failover. Classical
>>     cluster suites (corosync + pacemaker) are rather used to migrate
>>     services, but not to communicate node ids and number of total active
>>     nodes.  They can probably be tricked into doing that somehow, but
>>     they are not designed this way.  TIPC may be something to use here,
>>     but I found nothing "ready to use".
> 
> I have used keepalived in the past with very simple configuration
> files, and use their shell script API to interact with conntrackd.
> I did not spend much time on corosync/pacemaker so far.

I was mostly thinking about the cluster rules —
I'd love to have a daemon which could adjust cluster-total-nodes and cluster-local-nodes,
instead of having two rules on one firewall when the other fails.

I think I can make the latter work with pacemaker/corosync, and also have it support conntrackd, though,
it might be fiddly, but should be doable.

Many thanks for the elaborate answer,
	Oliver

-- 
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--

[-- Attachment #1.2: conntrackd.conf --]
[-- Type: text/plain, Size: 14825 bytes --]


# See also: http://conntrack-tools.netfilter.org/support.html
# 
# There are 3 different modes of running conntrackd: "alarm", "notrack" and "ftfw"
#
# The default package ships with a FTFW configuration, see /usr/share/doc/conntrackd*
# for example configurations for other modes.


#
# Synchronizer settings
#
Sync {
	Mode FTFW {
		#
		# Size of the resend queue (in objects). This is the maximum
		# number of objects that can be stored waiting to be confirmed
		# via acknoledgment. If you keep this value low, the daemon
		# will have less chances to recover state-changes under message
		# omission. On the other hand, if you keep this value high,
		# the daemon will consume more memory to store dead objects.
		# Default is 131072 objects.
		#
		# ResendQueueSize 131072

		#
		# This parameter allows you to set an initial fixed timeout
		# for the committed entries when this node goes from backup
		# to primary. This mechanism provides a way to purge entries
		# that were not recovered appropriately after the specified
		# fixed timeout. If you set a low value, TCP entries in
		# Established states with no traffic may hang. For example,
		# an SSH connection without KeepAlive enabled. If not set,
		# the daemon uses an approximate timeout value calculation
		# mechanism. By default, this option is not set.
		#
		# CommitTimeout 180

		#
		# If the firewall replica goes from primary to backup,
		# the conntrackd -t command is invoked in the script. 
		# This command schedules a flush of the table in N seconds.
		# This is useful to purge the connection tracking table of
		# zombie entries and avoid clashes with old entries if you
		# trigger several consecutive hand-overs. Default is 60 seconds.
		#
		# PurgeTimeout 60

		# Set the acknowledgement window size. If you decrease this
		# value, the number of acknowlegdments increases. More
		# acknowledgments means more overhead as conntrackd has to
		# handle more control messages. On the other hand, if you
		# increase this value, the resend queue gets more populated.
		# This results in more overhead in the queue releasing.
		# The following value is based on some practical experiments
		# measuring the cycles spent by the acknowledgment handling
		# with oprofile. If not set, default window size is 300.
		#
		# ACKWindowSize 300

		#
		# This clause allows you to disable the external cache. Thus,
		# the state entries are directly injected into the kernel
		# conntrack table. As a result, you save memory in user-space
		# but you consume slots in the kernel conntrack table for
		# backup state entries. Moreover, disabling the external cache
		# means more CPU consumption. You need a Linux kernel
		# >= 2.6.29 to use this feature. By default, this clause is
		# set off. If you are installing conntrackd for first time,
		# please read the user manual and I encourage you to consider
		# using the fail-over scripts instead of enabling this option!
		#
		DisableExternalCache On
	}

	#
	# Multicast IP and interface where messages are
	# broadcasted (dedicated link). IMPORTANT: Make sure
	# that iptables accepts traffic for destination
	# 225.0.0.50, eg:
	#
	#	iptables -I INPUT -d 225.0.0.50 -j ACCEPT
	#	iptables -I OUTPUT -d 225.0.0.50 -j ACCEPT
	#
	#Multicast {
		# 
		# Multicast address: The address that you use as destination
		# in the synchronization messages. You do not have to add
		# this IP to any of your existing interfaces. If any doubt,
		# do not modify this value.
		#
	#	IPv4_address 225.0.0.50

		#
		# The multicast group that identifies the cluster. If any
		# doubt, do not modify this value.
		#
	#	Group 3780

		#
		# IP address of the interface that you are going to use to
		# send the synchronization messages. Remember that you must
		# use a dedicated link for the synchronization messages.
		#
	#	IPv4_interface 192.168.100.100

		#
		# The name of the interface that you are going to use to
		# send the synchronization messages.
		#
	#	Interface eth2

		# The multicast sender uses a buffer to enqueue the packets
		# that are going to be transmitted. The default size of this
		# socket buffer is available at /proc/sys/net/core/wmem_default.
		# This value determines the chances to have an overrun in the
		# sender queue. The overrun results packet loss, thus, losing
		# state information that would have to be retransmitted. If you
		# notice some packet loss, you may want to increase the size
		# of the sender buffer. The default size is usually around
		# ~100 KBytes which is fairly small for busy firewalls.
		#
	#	SndSocketBuffer 1249280

		# The multicast receiver uses a buffer to enqueue the packets
		# that the socket is pending to handle. The default size of this
		# socket buffer is available at /proc/sys/net/core/rmem_default.
		# This value determines the chances to have an overrun in the
		# receiver queue. The overrun results packet loss, thus, losing
		# state information that would have to be retransmitted. If you
		# notice some packet loss, you may want to increase the size of
		# the receiver buffer. The default size is usually around
		# ~100 KBytes which is fairly small for busy firewalls.
		#
	#	RcvSocketBuffer 1249280

		# 
		# Enable/Disable message checksumming. This is a good
		# property to achieve fault-tolerance. In case of doubt, do
		# not modify this value.
		#
	#	Checksum on
	#}
	#
	# You can specify more than one dedicated link. Thus, if one dedicated
	# link fails, conntrackd can fail-over to another. Note that adding
	# more than one dedicated link does not mean that state-updates will
	# be sent to all of them. There is only one active dedicated link at
	# a given moment. The `Default' keyword indicates that this interface
	# will be selected as the initial dedicated link. You can have 
	# up to 4 redundant dedicated links. Note: Use different multicast 
	# groups for every redundant link.
	#
	# Multicast Default {
	#	IPv4_address 225.0.0.51
	#	Group 3781
	#	IPv4_interface 192.168.100.101
	#	Interface eth3
	#	# SndSocketBuffer 1249280
	#	# RcvSocketBuffer 1249280
	#	Checksum on
	# }

	#
	# You can use Unicast UDP instead of Multicast to propagate events.
	# Note that you cannot use unicast UDP and Multicast at the same
	# time, you can only select one.
	# 
	#UDP {
		# 
		# UDP address that this firewall uses to listen to events.
		#
		# IPv4_address 192.168.2.100
		#
		# or you may want to use an IPv6 address:
		#
		# IPv6_address fe80::215:58ff:fe28:5a27

		#
		# Destination UDP address that receives events, ie. the other
		# firewall's dedicated link address.
		#
		# IPv4_Destination_Address 192.168.2.101
		#
		# or you may want to use an IPv6 address:
		#
		# IPv6_Destination_Address fe80::2d0:59ff:fe2a:775c

		#
		# UDP port used
		#
		# Port 3780

		#
		# The name of the interface that you are going to use to
		# send the synchronization messages.
		#
		# Interface eth2

		# 
		# The sender socket buffer size
		#
		# SndSocketBuffer 1249280

		#
		# The receiver socket buffer size
		#
		# RcvSocketBuffer 1249280

		# 
		# Enable/Disable message checksumming. 
		#
		# Checksum on
	# }

	# main connection via crossover cable
	UDP Default {
                IPv4_address 192.168.1.1
                IPv4_Destination_Address 192.168.1.2
                Port 3780
                Interface eno1
                SndSocketBuffer 24985600
                RcvSocketBuffer 24985600
                Checksum on
        }
	# backup via virt network
	UDP {
               IPv4_address 10.160.5.204
               IPv4_Destination_Address 10.160.5.205
               Port 3780
               Interface eno2
               SndSocketBuffer 24985600
               RcvSocketBuffer 24985600
               Checksum on
        }

	# 
	# Other unsorted options that are related to the synchronization.
	# 
	Options {
		#
		# TCP state-entries have window tracking disabled by default,
		# you can enable it with this option. As said, default is off.
		# This feature requires a Linux kernel >= 2.6.36.
		#
		# TCPWindowTracking Off
		TCPWindowTracking On

		#ExpectationSync on
		#ExpectationSync {
		#	h.323
		#}
	}
}

#
# General settings
#
General {
	#
	# Set the nice value of the daemon, this value goes from -20
	# (most favorable scheduling) to 19 (least favorable). Using a
	# very low value reduces the chances to lose state-change events.
	# Default is 0 but this example file sets it to most favourable
	# scheduling as this is generally a good idea. See man nice(1) for
	# more information.
	#
	Nice -20

	#
	# Select a different scheduler for the daemon, you can select between
	# RR and FIFO and the process priority (minimum is 0, maximum is 99).
	# See man sched_setscheduler(2) for more information. Using a RT
	# scheduler reduces the chances to overrun the Netlink buffer.
	#
	# Scheduler {
	#	Type FIFO
	#	Priority 99
	# }

	#
	# Number of buckets in the cache hashtable. The bigger it is,
	# the closer it gets to O(1) at the cost of consuming more memory.
	# Read some documents about tuning hashtables for further reference.
	#
	HashSize 32768

	#
	# Maximum number of conntracks, it should be double of: 
	# $ cat /proc/sys/net/netfilter/nf_conntrack_max
	# since the daemon may keep some dead entries cached for possible
	# retransmission during state synchronization.
	#
	HashLimit 131072

	#
	# Logfile: on (/var/log/conntrackd.log), off, or a filename
	# Default: off
	#
	LogFile on

	#
	# Syslog: on, off or a facility name (daemon (default) or local0..7)
	# Default: off
	#
	#Syslog on

	#
	# Lockfile
	# 
	LockFile /var/lock/conntrack.lock

	#
	# Unix socket configuration
	#
	UNIX {
		Path /var/run/conntrackd.ctl
		Backlog 20
	}

	#
	# Netlink event socket buffer size. If you do not specify this clause,
	# the default buffer size value in /proc/net/core/rmem_default is
	# used. This default value is usually around 100 Kbytes which is
	# fairly small for busy firewalls. This leads to event message dropping
	# and high CPU consumption. This example configuration file sets the
	# size to 2 MBytes to avoid this sort of problems.
	#
	NetlinkBufferSize 2097152

	#
	# The daemon doubles the size of the netlink event socket buffer size
	# if it detects netlink event message dropping. This clause sets the
	# maximum buffer size growth that can be reached. This example file
	# sets the size to 8 MBytes.
	#
	NetlinkBufferSizeMaxGrowth 8388608

	#
	# If the daemon detects that Netlink is dropping state-change events,
	# it automatically schedules a resynchronization against the Kernel
	# after 30 seconds (default value). Resynchronizations are expensive
	# in terms of CPU consumption since the daemon has to get the full
	# kernel state-table and purge state-entries that do not exist anymore.
	# Be careful of setting a very small value here. You have the following
	# choices: On (enabled, use default 30 seconds value), Off (disabled)
	# or Value (in seconds, to set a specific amount of time). If not
	# specified, the daemon assumes that this option is enabled.
	#
	# NetlinkOverrunResync On

	#
	# If you want reliable event reporting over Netlink, set on this
	# option. If you set on this clause, it is a good idea to set off
	# NetlinkOverrunResync. This option is off by default and you need
	# a Linux kernel >= 2.6.31.
	#
	# NetlinkEventsReliable Off
	NetlinkEventsReliable On

	# 
	# By default, the daemon receives state updates following an
	# event-driven model. You can modify this behaviour by switching to
	# polling mode with the PollSecs clause. This clause tells conntrackd
	# to dump the states in the kernel every N seconds. With regards to
	# synchronization mode, the polling mode can only guarantee that
	# long-lifetime states are recovered. The main advantage of this method
	# is the reduction in the state replication at the cost of reducing the
	# chances of recovering connections.
	#
	PollSecs 15

	#
	# The daemon prioritizes the handling of state-change events coming
	# from the core. With this clause, you can set the maximum number of
	# state-change events (those coming from kernel-space) that the daemon
	# will handle after which it will handle other events coming from the
	# network or userspace. A low value improves interactivity (in terms of
	# real-time behaviour) at the cost of extra CPU consumption.
	# Default (if not set) is 100.
	#
	# EventIterationLimit 100

	#
	# Event filtering: This clause allows you to filter certain traffic,
	# There are currently three filter-sets: Protocol, Address and
	# State. The filter is attached to an action that can be: Accept or
	# Ignore. Thus, you can define the event filtering policy of the
	# filter-sets in positive or negative logic depending on your needs.
	# You can select if conntrackd filters the event messages from 
	# user-space or kernel-space. The kernel-space event filtering
	# saves some CPU cycles by avoiding the copy of the event message
	# from kernel-space to user-space. The kernel-space event filtering
	# is prefered, however, you require a Linux kernel >= 2.6.29 to
	# filter from kernel-space. If you want to select kernel-space 
	# event filtering, use the keyword 'Kernelspace' instead of 
	# 'Userspace'.
	#
	Filter From Kernelspace {
		#
		# Accept only certain protocols: You may want to replicate
		# the state of flows depending on their layer 4 protocol.
		#
		Protocol Accept {
			TCP
			SCTP
			DCCP
                        UDP
                        ICMP
                        IPv6-ICMP
			# UDP
			# ICMP # This requires a Linux kernel >= 2.6.31
			# IPv6-ICMP # This requires a Linux kernel >= 2.6.31
		}

		#
		# Ignore traffic for a certain set of IP's: Usually all the
		# IP assigned to the firewall since local traffic must be
		# ignored, only forwarded connections are worth to replicate.
		# Note that these values depends on the local IPs that are
		# assigned to the firewall.
		#
		Address Ignore {
			IPv4_address 127.0.0.1 # loopback
                        IPv4_address 10.160.5.203 # VIP
			IPv4_address 10.160.5.204 # IP FW 1
			IPv4_address 10.160.5.205 # IP FW 2
			IPv4_address 192.168.1.0/24 # Crossover IPs
                        IPv6_address ::1 # loopback
			#IPv4_address 192.168.100.100 # dedicated link ip
			#
			# You can also specify networks in format IP/cidr.
			# IPv4_address 192.168.0.0/24
			#
			# You can also specify an IPv6 address
			# IPv6_address ::1
		}

		#
		# Uncomment this line below if you want to filter by flow state.
		# This option introduces a trade-off in the replication: it
		# reduces CPU consumption at the cost of having lazy backup 
		# firewall replicas. The existing TCP states are: SYN_SENT,
		# SYN_RECV, ESTABLISHED, FIN_WAIT, CLOSE_WAIT, LAST_ACK,
		# TIME_WAIT, CLOSED, LISTEN.
		#
		# State Accept {
		#	ESTABLISHED CLOSED TIME_WAIT CLOSE_WAIT for TCP
		# }
	}
}

[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5432 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Running an active/active firewall/router (xt_cluster?)
  2021-05-10 22:55     ` Paul Robert Marino
@ 2021-05-10 23:21       ` Oliver Freyermuth
       [not found]         ` <CAPJdpdDNmTq_yafDU12w1xz7PUTm4zZr6vt2nGciv=baGYwP1A@mail.gmail.com>
  0 siblings, 1 reply; 11+ messages in thread
From: Oliver Freyermuth @ 2021-05-10 23:21 UTC (permalink / raw)
  To: Paul Robert Marino; +Cc: netfilter

[-- Attachment #1: Type: text/plain, Size: 16764 bytes --]

Also answering inline.

Am 11.05.21 um 00:55 schrieb Paul Robert Marino:
> I'm adding replies to your replies inline below
> 
> On Mon, May 10, 2021, 5:55 PM Oliver Freyermuth
> <freyermuth@physik.uni-bonn.de> wrote:
>>
>> Hey Paul,
>>
>> many thanks for the detailed reply!
>> Some comments inline.
>>
>> Am 10.05.21 um 18:57 schrieb Paul Robert Marino:
>>> hey Oliver,
>>> I've done similar things over the years, a lot of fun lab experiments
>>> and found it really it comes down to a couple of things.
>>> I did some POC testing with contrackd and some experimental code
>>> around trunking across multiple firewalls with a sprinkling of
>>> virtualization.
>>> There were a few scenarios I tried, some involving OpenVSwitch
>>> (because I was experimenting with SPBM) and some not, with contrackd
>>> similarly configured.
>>> All the scenarios were interesting but they all had relatively rare on
>>> slow (<1Gbs) network issues that grew exponentially in frequency on
>>> higher speed networks (>1Gbs) with latency in contrackd syncing.
>>
>> Indeed, we'd strive for ~20 Gb/s in our case, so this experience surely is important to hear about.
>>
>>> What i found is the best scenario was to use Quagga for dynamic
>>> routing to load balance the traffic between the firewall IP's,
>>> keepalived to load handle IP failover, and contrackd (in a similar
>>> configuration to the one you described) to keep the states in sync
>>> there are a few pitfalls in going down this route caused by bad and or
>>> outdated documentation for both Quagga and keepalived. I'm also going
>>> to give you some recommendations about some hardware topology stuff
>>> you may not think about initially.
>>
>> I'm still a bit unsure if we are on the same page, but that may just be caused by my limited knowledge of Quagga.
>> To my understanding, Quagga uses e.g. OSPF and hence can, if the routes have the same path, load-balance.
>>
>> However, in our case, we'd want to go for active/active firewalls (which of course are also routers).
>> But that means we have internal machines on one side, which use a single default gateway (per VLAN),
>> then our active/active firewall, and then the outside world (actually a PtP connection to an upstream router).
>>
>> Can Quagga help me to actively use both firewalls in a load-balancing and redundant way?
>> The idea here is that the upstream router has high bandwidth, so using more than one firewall allows to achieve better throughput,
>> and with active/active we'd also strive for redundancy (i.e. reduced throughput if one firewall fails).
>> To my understanding, OSPF / Quagga could do this if the firewalls are placed between routers also joining via OSPF.
>> But is there also a way to have the clients directly talk to our firewalls, and the firewalls to a single upstream router (which we don't control)?
>>
>> A simple drawing may help:
>>
>>                ____  FW A ____
>>               /               \
>> Client(s) --                 --PtP-- upstream router
>>               \____  FW B ____/
>>
>> This is why I thought about using xt_cluster and giving both FW A and FW B the very same IP (the default gateway of the clients)
>> and the very same MAC at the same time, so the switch duplicates the packets, and then FW A accepts some packets and FW B the remaining ones
>> via filtering with xt_cluster.
>>
>> Can Quagga do something in this picture, or simplify this picture?
>> The upstream router also sends all incoming packets to a single IP in the PtP network, i.e. the firewall nodes need to show up as "one converged system"
>> to both the clients on one side and the upstream router on the other side.
> 
> 
> I understand what you are shooting for but it's dangerous at those
> data rates and not achievable via stock existing software.
> I did write some POC code years ago for a previous employer but
> determined it was too dangerous to put into production without some
> massive kernel changes such as using something like RDMA over
> dedicated high speed interfaces or linking the systems over the PCI
> express busses to sync the states instead of using contrackd.
> 
> So load balancing is a better choice in this case, and many middle to
> higher end managed switches that have routers built in can do OSPF.
> I've seen many stackable switches that can do it. By the way Quagga
> supports several other dynamic routing protocols not just just OSPF.

Thanks, now I understand your answer much better — the classical case of intention getting lost between the lines.
Indeed, this is important experience, many thanks for sharing it!

I was already unsure if with such a solution I could really expect to achieve these data rates,
so this warning is worth its weight in gold.
I'll still play around with this setup in the lab, but testing at scale is also not easy
(for us) in the lab, so again this warning is very useful so we won't take this into production.

The problem which made me think about all this is that we don't have control of the upstream router.
That made me hope for a solution which does not require changes on that end.
But of course we can communicate with the operators
and see if we can find a way to use dynamic routing on that end.

> The safest and easiest option for you would be to use 100Gbs fibre
> connection instead, possibly with direct attach cables if you want to
> save on optics, and do primary secondary failover.

Sadly, the infrastructure further upstream is not yet upgraded to support 100 Gb/s (and will not be in the near future),
otherwise, this surely would have been the easier option.

>>> I will start with Quagga because the bad documentation part is easy to cover.
>>> in the Quagga documentation they recommend that you put a routable IP
>>> on a loopback interface and attach Quagga the daemon for the dynamic
>>> routing service of your choice to it, That works fine on BSD and old
>>> versions of Linux from 20 years ago but any thing running a Linux
>>> kernel version of 2.4 or higher will not allow it unless you change
>>> setting in /etc/sysctrl.conf and the Quagga documentation tells you to
>>> make those changes. DO NOT DO WHAT THEY SAY, its wrong and dangerous.
>>> Instead create a "dummy" interface with a routable IP for this
>>> purpose. a dummy interface is a special kind of interface meant for
>>> exactly the scenario described and works well without compromising the
>>> security of your firewall.
>>
>> Thanks for this helpful advice!
>> Even though I am not sure yet Quagga will help me out in this picture,
>> I am now already convinced we will have a situation in which Quagga will help us out.
>> So this is noted down for future use :-).
>>
>>> Keepalived
>>> the main error in keepalived's documentation is is most of the
>>> documentation and howto's you will find about it on the web are based
>>> on a 15 year old howto which had a fundamental mistake in how VRRP
>>> works, and what the "state"  flag actually does because its not
>>> explained well in the man file. "state" in a "vrrp_instance" should
>>> always be set to "MASTER" on all nodes and the priority should be used
>>> to determine which node should be the preferred master. the only time
>>> you should ever set state to "BACKUP" is if you have a 3rd machine
>>> that you never want to become the master which you are just using for
>>> quorum and in that case its priority should also be set to "0"
>>> (failed) . setting the state to "BACKUP" will seem to work fine until
>>> you have a failover event when the interface will continually go ip
>>> and done on the backup node. on the mac address issue keepalived will
>>> apr ping the subnets its attached to so that's generally not an issue
>>> but I would recommend using vmac's (virtual mac addresses) assuming
>>> the kernel for your distro and your network cards support it because
>>> that way it just looks to the switch like it changed a port due to
>>> some physical topology change and switches usually handle that very
>>> gracefully, but don't always handle the mac address change for IP
>>> addresses as quickly.
>>> I also recommend reading the RFC's on VRRP particularly the parts that
>>> explain how the elections and priorities work, they are a quick and
>>> easy read and will really give you a good idea of how to configure
>>> keepalived properly to achieve the failover and recovery behavior you
>>> want.
>>
>> See above on the virtual MACs — if the clients should use both firewalls at the same time,
>> I think I'd need a single MAC for both, so the clients only see a single default gateway.
>> In a more classic setup, we've used pcs (pacemaker and corosync) to successfully migrate virtual IPs and MAC addresses.
>> It has worked quite reliable (using Kronosnet for communication).
>> But we've also used Keepalived some years ago successfully :-).
>>
>>> On the hardware topology
>>> I recommend using dedicated interfaces for contrackd, really you don't
>>> need anything faster than 100Mbps even if the data interfaces are
>>> 100Gbps but i usually use 1 Gbps interfaces for this. they can be on
>>> their own dedicated switches or crossover interfaces. the main concern
>>> here is securely handling a large number of tiny packets so having
>>> dedicated network card buffers to handle microburst  is useful and if
>>> you can avoid latency from a switch that's trying to be too smart for
>>> its own good that's for the best.
>>
>> Indeed, we have 1 Gb/s crossover link, and use a 1 Gb/s connection through a switch in case this would ever fail for some reason —
>> we use these links both for conntrackd and for Kronosnet communication by corosync.
>>
>>> For keepalived use dedicated VLAN's on each physical interface to
>>> handle the heartbeats and group the VRRP interfaces. to insure the
>>> failovers of the IP's on both sides are handled correctly.
>>> If you only have 2 firewalls I recommend using a an additional device
>>> on each side for quorum in a backup/failed mode as described above.
>>> Assuming a 1 second or greater interval the device could be something
>>> as simple as a Raspberry PI it really doesn't need to be anything
>>> powerful because its just adding a heartbeat to the cluster, but for
>>> sub second intervals you may need something more powerful because sub
>>> second intervals can eat a surprising amount of CPU.
>>
>> We currently went without an external third party and let corosync/pacemaker use a STONITH device to explicitly kill the other node
>> and establish a defined state if heartbeats get lost. We might think about a third machine at some point to get an actual quorum, indeed.
> 
> 
> I get why you might think to use corosync/pacemaker for this if you
> weren't familiar with keepalived and LVS in the kernel,  but it's
> hammering a square peg in a round hole when you have a perfectly
> shaped and sized peg available to you that's actually been around a
> lot longer and works a lot more predictably, faster and more reliably
> by leveraging parts of the kernels network stack designed specifically
> for this use case. I've done explicit kills of the other device via
> cross connected hardware watchdog devices via keepalived before and it
> was easy.
> By the way if you don't know what LVS is it's the kernels builtin
> layer 3 network load balancer stack that was designed with these kind
> of failover scenarios in mind keepalived is just a wrapper around LVS
> that adds VRRP based heartbeating and hooks to allow you to call
> external scripts for actions based on heart beat state change events
> and additional watchdog scripts which can also trigger state changes.
> To be clear i wouldn't use keepalived to handle process master slave
> failovers i would use corosync and pacemaker, or in some cases
> Clusterd for that because they are usually the right tool for the job,
> but for firewall and or network load balancer failover i would always
> use keepalived because its the right tool for that job.
> 

Our main reasoning for corosync/pacemaker was that we've used it for the predecessor setup quite successfully for ~7 years,
while we have only used keepalived in smaller configurations (but it also served us well).
You raise many valid points, so even though pacemaker/corosync has not disappointed us (as of yet), we might indeed reconsider this decision.

Cheers and thanks,
	Oliver

> 
>>
>> Cheers and thanks again,
>>          Oliver
>>
>>>
>>>
>>> On Sun, May 9, 2021 at 3:16 PM Oliver Freyermuth
>>> <freyermuth@physik.uni-bonn.de> wrote:
>>>>
>>>> Dear netfilter experts,
>>>>
>>>> we are trying to setup an active/active firewall, making use of "xt_cluster".
>>>> We can configure the switch to act like a hub, i.e. both machines can share the same MAC and IP and get the same packets without additional ARPtables tricks.
>>>>
>>>> So we set rules like:
>>>>
>>>>     iptables -I PREROUTING -t mangle -i external_interface -m cluster --cluster-total-nodes 2 --cluster-local-node 1 --cluster-hash-seed 0xdeadbeef -j MARK --set-mark 0xffff
>>>>     iptables -A PREROUTING -t mangle -i external_interface -m mark ! --mark 0xffff -j DROP
>>>>
>>>> Ideally, it we'd love to have the possibility to scale this to more than two nodes, but let's stay with two for now.
>>>>
>>>> Basic tests show that this works as expected, but the details get messy.
>>>>
>>>> 1. Certainly, conntrackd is needed to synchronize connection states.
>>>>       But is it always "fast enough"?
>>>>       xt_cluster seems to match by the src_ip of the original direction of the flow[0] (if I read the code correctly),
>>>>       but what happens if the reply to an outgoing packet arrives at both firewalls before state is synchronized?
>>>>       We are currently using conntrackd in FTFW mode with a direct link, set "DisableExternalCache", and additonally set "PollSecs 15" since without that it seems
>>>>       only new and destroyed connections are synced, but lifetime updates for existing connections do not propagate without polling.
>>>>       Maybe another way which e.g. may use XOR(src,dst) might work around tight synchronization requirements, or is it possible to always uses the "internal" source IP?
>>>>       Is anybody doing that with a custom BPF?
>>>>
>>>> 2. How to do failover in such cases?
>>>>       For failover we'd need to change these rules (if one node fails, the total-nodes will change).
>>>>       As an alternative, I found [1] which states multiple rules can be used and enabled / disabled,
>>>>       but does somebody know of a cleaner (and easier to read) way, also not costing extra performance?
>>>>
>>>> 3. We have several internal networks, which need to talk to each other (partially with firewall rules and NATting),
>>>>       so we'd also need similar rules there, complicating things more. That's why a cleaner way would be very welcome :-).
>>>>
>>>> 4. Another point is how to actually perform the failover. Classical cluster suites (corosync + pacemaker)
>>>>       are rather used to migrate services, but not to communicate node ids and number of total active nodes.
>>>>       They can probably be tricked into doing that somehow, but they are not designed this way.
>>>>       TIPC may be something to use here, but I found nothing "ready to use".
>>>>
>>>> You may also tell me there's a better way to do this than use xt_cluster (custom BPF?) — we've up to now only done "classic" active/passive setups,
>>>> but maybe someone on this list has already done active/active without commercial hardware, and can share experience from this?
>>>>
>>>> Cheers and thanks in advance,
>>>>           Oliver
>>>>
>>>> PS: Please keep me in CC, I'm not subscribed to the list. Thanks!
>>>>
>>>> [0] https://github.com/torvalds/linux/blob/10a3efd0fee5e881b1866cf45950808575cb0f24/net/netfilter/xt_cluster.c#L16-L19
>>>> [1] https://lore.kernel.org/netfilter-devel/499BEBBF.7080705@netfilter.org/
>>>>
>>>> --
>>>> Oliver Freyermuth
>>>> Universität Bonn
>>>> Physikalisches Institut, Raum 1.047
>>>> Nußallee 12
>>>> 53115 Bonn
>>>> --
>>>> Tel.: +49 228 73 2367
>>>> Fax:  +49 228 73 7869
>>>> --
>>>>
>>
>>
>> --
>> Oliver Freyermuth
>> Universität Bonn
>> Physikalisches Institut, Raum 1.047
>> Nußallee 12
>> 53115 Bonn
>> --
>> Tel.: +49 228 73 2367
>> Fax:  +49 228 73 7869
>> --
>>


-- 
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5432 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Running an active/active firewall/router (xt_cluster?)
       [not found]         ` <CAPJdpdDNmTq_yafDU12w1xz7PUTm4zZr6vt2nGciv=baGYwP1A@mail.gmail.com>
@ 2021-05-11  9:08           ` Oliver Freyermuth
  0 siblings, 0 replies; 11+ messages in thread
From: Oliver Freyermuth @ 2021-05-11  9:08 UTC (permalink / raw)
  To: Paul Robert Marino; +Cc: netfilter

[-- Attachment #1: Type: text/plain, Size: 21380 bytes --]

Am 11.05.21 um 03:00 schrieb Paul Robert Marino:
> Well In the scenario where you don't control the upstream router I would recommend putting a small routing switch stack in the middle. The reason being it solves a lot of the potential hardware issues as well around redundancy and loadbalancing. Ideally I always like to see a separate routing switch stack on both sides that can only be managed by an OOB network on dedicated ports. 

We actually have a (non-redundant) routing switch on one end, and a the mentioned routing stack on the other end, but _both_ are not controlled by us (we could take control of one,
but since the operators maintain the full switch infrastructure and are very cooperative and experienced, we preferred to leave also this component to them for now).
So we'll definitely seek contact to the operators and try to convince them to save extra hardware (but since the infrastructure on one end is shared, this may need some discussion).

> Back when I did this stuff on a regular large scale (managed hundreds of firewalls) basis I would use cheep Avaya (originally Nortel, now Extream Networks) ERS or VSP with stack for this because they had the right features at a reasonable price, but any switches that do stacking or can do multiple 10Gbps uplinks with routing should do. That was what I always found to be the most stable configuration. There also may be advantages to using a switch stack from the same manufacturer as the up stream router. That also opens up the possibility of doing 100Gbps to the intermediate switches and doing a more 
> traditional primary backup configuration on the firewalls.

Thanks! We actually have a general contract to "prefer" components from one of the (expensive) manufacturers, which is good to get things more homogeneous at least.
Let's see how the discussion turns out :-).

Cheers and many thanks,
	Oliver

> 
> On Mon, May 10, 2021, 7:21 PM Oliver Freyermuth <freyermuth@physik.uni-bonn.de <mailto:freyermuth@physik.uni-bonn.de>> wrote:
> 
>     Also answering inline.
> 
>     Am 11.05.21 um 00:55 schrieb Paul Robert Marino:
>      > I'm adding replies to your replies inline below
>      >
>      > On Mon, May 10, 2021, 5:55 PM Oliver Freyermuth
>      > <freyermuth@physik.uni-bonn.de <mailto:freyermuth@physik.uni-bonn.de>> wrote:
>      >>
>      >> Hey Paul,
>      >>
>      >> many thanks for the detailed reply!
>      >> Some comments inline.
>      >>
>      >> Am 10.05.21 um 18:57 schrieb Paul Robert Marino:
>      >>> hey Oliver,
>      >>> I've done similar things over the years, a lot of fun lab experiments
>      >>> and found it really it comes down to a couple of things.
>      >>> I did some POC testing with contrackd and some experimental code
>      >>> around trunking across multiple firewalls with a sprinkling of
>      >>> virtualization.
>      >>> There were a few scenarios I tried, some involving OpenVSwitch
>      >>> (because I was experimenting with SPBM) and some not, with contrackd
>      >>> similarly configured.
>      >>> All the scenarios were interesting but they all had relatively rare on
>      >>> slow (<1Gbs) network issues that grew exponentially in frequency on
>      >>> higher speed networks (>1Gbs) with latency in contrackd syncing.
>      >>
>      >> Indeed, we'd strive for ~20 Gb/s in our case, so this experience surely is important to hear about.
>      >>
>      >>> What i found is the best scenario was to use Quagga for dynamic
>      >>> routing to load balance the traffic between the firewall IP's,
>      >>> keepalived to load handle IP failover, and contrackd (in a similar
>      >>> configuration to the one you described) to keep the states in sync
>      >>> there are a few pitfalls in going down this route caused by bad and or
>      >>> outdated documentation for both Quagga and keepalived. I'm also going
>      >>> to give you some recommendations about some hardware topology stuff
>      >>> you may not think about initially.
>      >>
>      >> I'm still a bit unsure if we are on the same page, but that may just be caused by my limited knowledge of Quagga.
>      >> To my understanding, Quagga uses e.g. OSPF and hence can, if the routes have the same path, load-balance.
>      >>
>      >> However, in our case, we'd want to go for active/active firewalls (which of course are also routers).
>      >> But that means we have internal machines on one side, which use a single default gateway (per VLAN),
>      >> then our active/active firewall, and then the outside world (actually a PtP connection to an upstream router).
>      >>
>      >> Can Quagga help me to actively use both firewalls in a load-balancing and redundant way?
>      >> The idea here is that the upstream router has high bandwidth, so using more than one firewall allows to achieve better throughput,
>      >> and with active/active we'd also strive for redundancy (i.e. reduced throughput if one firewall fails).
>      >> To my understanding, OSPF / Quagga could do this if the firewalls are placed between routers also joining via OSPF.
>      >> But is there also a way to have the clients directly talk to our firewalls, and the firewalls to a single upstream router (which we don't control)?
>      >>
>      >> A simple drawing may help:
>      >>
>      >>                ____  FW A ____
>      >>               /               \
>      >> Client(s) --                 --PtP-- upstream router
>      >>               \____  FW B ____/
>      >>
>      >> This is why I thought about using xt_cluster and giving both FW A and FW B the very same IP (the default gateway of the clients)
>      >> and the very same MAC at the same time, so the switch duplicates the packets, and then FW A accepts some packets and FW B the remaining ones
>      >> via filtering with xt_cluster.
>      >>
>      >> Can Quagga do something in this picture, or simplify this picture?
>      >> The upstream router also sends all incoming packets to a single IP in the PtP network, i.e. the firewall nodes need to show up as "one converged system"
>      >> to both the clients on one side and the upstream router on the other side.
>      >
>      >
>      > I understand what you are shooting for but it's dangerous at those
>      > data rates and not achievable via stock existing software.
>      > I did write some POC code years ago for a previous employer but
>      > determined it was too dangerous to put into production without some
>      > massive kernel changes such as using something like RDMA over
>      > dedicated high speed interfaces or linking the systems over the PCI
>      > express busses to sync the states instead of using contrackd.
>      >
>      > So load balancing is a better choice in this case, and many middle to
>      > higher end managed switches that have routers built in can do OSPF.
>      > I've seen many stackable switches that can do it. By the way Quagga
>      > supports several other dynamic routing protocols not just just OSPF.
> 
>     Thanks, now I understand your answer much better — the classical case of intention getting lost between the lines.
>     Indeed, this is important experience, many thanks for sharing it!
> 
>     I was already unsure if with such a solution I could really expect to achieve these data rates,
>     so this warning is worth its weight in gold.
>     I'll still play around with this setup in the lab, but testing at scale is also not easy
>     (for us) in the lab, so again this warning is very useful so we won't take this into production.
> 
>     The problem which made me think about all this is that we don't have control of the upstream router.
>     That made me hope for a solution which does not require changes on that end.
>     But of course we can communicate with the operators
>     and see if we can find a way to use dynamic routing on that end.
> 
>      > The safest and easiest option for you would be to use 100Gbs fibre
>      > connection instead, possibly with direct attach cables if you want to
>      > save on optics, and do primary secondary failover.
> 
>     Sadly, the infrastructure further upstream is not yet upgraded to support 100 Gb/s (and will not be in the near future),
>     otherwise, this surely would have been the easier option.
> 
>      >>> I will start with Quagga because the bad documentation part is easy to cover.
>      >>> in the Quagga documentation they recommend that you put a routable IP
>      >>> on a loopback interface and attach Quagga the daemon for the dynamic
>      >>> routing service of your choice to it, That works fine on BSD and old
>      >>> versions of Linux from 20 years ago but any thing running a Linux
>      >>> kernel version of 2.4 or higher will not allow it unless you change
>      >>> setting in /etc/sysctrl.conf and the Quagga documentation tells you to
>      >>> make those changes. DO NOT DO WHAT THEY SAY, its wrong and dangerous.
>      >>> Instead create a "dummy" interface with a routable IP for this
>      >>> purpose. a dummy interface is a special kind of interface meant for
>      >>> exactly the scenario described and works well without compromising the
>      >>> security of your firewall.
>      >>
>      >> Thanks for this helpful advice!
>      >> Even though I am not sure yet Quagga will help me out in this picture,
>      >> I am now already convinced we will have a situation in which Quagga will help us out.
>      >> So this is noted down for future use :-).
>      >>
>      >>> Keepalived
>      >>> the main error in keepalived's documentation is is most of the
>      >>> documentation and howto's you will find about it on the web are based
>      >>> on a 15 year old howto which had a fundamental mistake in how VRRP
>      >>> works, and what the "state"  flag actually does because its not
>      >>> explained well in the man file. "state" in a "vrrp_instance" should
>      >>> always be set to "MASTER" on all nodes and the priority should be used
>      >>> to determine which node should be the preferred master. the only time
>      >>> you should ever set state to "BACKUP" is if you have a 3rd machine
>      >>> that you never want to become the master which you are just using for
>      >>> quorum and in that case its priority should also be set to "0"
>      >>> (failed) . setting the state to "BACKUP" will seem to work fine until
>      >>> you have a failover event when the interface will continually go ip
>      >>> and done on the backup node. on the mac address issue keepalived will
>      >>> apr ping the subnets its attached to so that's generally not an issue
>      >>> but I would recommend using vmac's (virtual mac addresses) assuming
>      >>> the kernel for your distro and your network cards support it because
>      >>> that way it just looks to the switch like it changed a port due to
>      >>> some physical topology change and switches usually handle that very
>      >>> gracefully, but don't always handle the mac address change for IP
>      >>> addresses as quickly.
>      >>> I also recommend reading the RFC's on VRRP particularly the parts that
>      >>> explain how the elections and priorities work, they are a quick and
>      >>> easy read and will really give you a good idea of how to configure
>      >>> keepalived properly to achieve the failover and recovery behavior you
>      >>> want.
>      >>
>      >> See above on the virtual MACs — if the clients should use both firewalls at the same time,
>      >> I think I'd need a single MAC for both, so the clients only see a single default gateway.
>      >> In a more classic setup, we've used pcs (pacemaker and corosync) to successfully migrate virtual IPs and MAC addresses.
>      >> It has worked quite reliable (using Kronosnet for communication).
>      >> But we've also used Keepalived some years ago successfully :-).
>      >>
>      >>> On the hardware topology
>      >>> I recommend using dedicated interfaces for contrackd, really you don't
>      >>> need anything faster than 100Mbps even if the data interfaces are
>      >>> 100Gbps but i usually use 1 Gbps interfaces for this. they can be on
>      >>> their own dedicated switches or crossover interfaces. the main concern
>      >>> here is securely handling a large number of tiny packets so having
>      >>> dedicated network card buffers to handle microburst  is useful and if
>      >>> you can avoid latency from a switch that's trying to be too smart for
>      >>> its own good that's for the best.
>      >>
>      >> Indeed, we have 1 Gb/s crossover link, and use a 1 Gb/s connection through a switch in case this would ever fail for some reason —
>      >> we use these links both for conntrackd and for Kronosnet communication by corosync.
>      >>
>      >>> For keepalived use dedicated VLAN's on each physical interface to
>      >>> handle the heartbeats and group the VRRP interfaces. to insure the
>      >>> failovers of the IP's on both sides are handled correctly.
>      >>> If you only have 2 firewalls I recommend using a an additional device
>      >>> on each side for quorum in a backup/failed mode as described above.
>      >>> Assuming a 1 second or greater interval the device could be something
>      >>> as simple as a Raspberry PI it really doesn't need to be anything
>      >>> powerful because its just adding a heartbeat to the cluster, but for
>      >>> sub second intervals you may need something more powerful because sub
>      >>> second intervals can eat a surprising amount of CPU.
>      >>
>      >> We currently went without an external third party and let corosync/pacemaker use a STONITH device to explicitly kill the other node
>      >> and establish a defined state if heartbeats get lost. We might think about a third machine at some point to get an actual quorum, indeed.
>      >
>      >
>      > I get why you might think to use corosync/pacemaker for this if you
>      > weren't familiar with keepalived and LVS in the kernel,  but it's
>      > hammering a square peg in a round hole when you have a perfectly
>      > shaped and sized peg available to you that's actually been around a
>      > lot longer and works a lot more predictably, faster and more reliably
>      > by leveraging parts of the kernels network stack designed specifically
>      > for this use case. I've done explicit kills of the other device via
>      > cross connected hardware watchdog devices via keepalived before and it
>      > was easy.
>      > By the way if you don't know what LVS is it's the kernels builtin
>      > layer 3 network load balancer stack that was designed with these kind
>      > of failover scenarios in mind keepalived is just a wrapper around LVS
>      > that adds VRRP based heartbeating and hooks to allow you to call
>      > external scripts for actions based on heart beat state change events
>      > and additional watchdog scripts which can also trigger state changes.
>      > To be clear i wouldn't use keepalived to handle process master slave
>      > failovers i would use corosync and pacemaker, or in some cases
>      > Clusterd for that because they are usually the right tool for the job,
>      > but for firewall and or network load balancer failover i would always
>      > use keepalived because its the right tool for that job.
>      >
> 
>     Our main reasoning for corosync/pacemaker was that we've used it for the predecessor setup quite successfully for ~7 years,
>     while we have only used keepalived in smaller configurations (but it also served us well).
>     You raise many valid points, so even though pacemaker/corosync has not disappointed us (as of yet), we might indeed reconsider this decision.
> 
>     Cheers and thanks,
>              Oliver
> 
>      >
>      >>
>      >> Cheers and thanks again,
>      >>          Oliver
>      >>
>      >>>
>      >>>
>      >>> On Sun, May 9, 2021 at 3:16 PM Oliver Freyermuth
>      >>> <freyermuth@physik.uni-bonn.de <mailto:freyermuth@physik.uni-bonn.de>> wrote:
>      >>>>
>      >>>> Dear netfilter experts,
>      >>>>
>      >>>> we are trying to setup an active/active firewall, making use of "xt_cluster".
>      >>>> We can configure the switch to act like a hub, i.e. both machines can share the same MAC and IP and get the same packets without additional ARPtables tricks.
>      >>>>
>      >>>> So we set rules like:
>      >>>>
>      >>>>     iptables -I PREROUTING -t mangle -i external_interface -m cluster --cluster-total-nodes 2 --cluster-local-node 1 --cluster-hash-seed 0xdeadbeef -j MARK --set-mark 0xffff
>      >>>>     iptables -A PREROUTING -t mangle -i external_interface -m mark ! --mark 0xffff -j DROP
>      >>>>
>      >>>> Ideally, it we'd love to have the possibility to scale this to more than two nodes, but let's stay with two for now.
>      >>>>
>      >>>> Basic tests show that this works as expected, but the details get messy.
>      >>>>
>      >>>> 1. Certainly, conntrackd is needed to synchronize connection states.
>      >>>>       But is it always "fast enough"?
>      >>>>       xt_cluster seems to match by the src_ip of the original direction of the flow[0] (if I read the code correctly),
>      >>>>       but what happens if the reply to an outgoing packet arrives at both firewalls before state is synchronized?
>      >>>>       We are currently using conntrackd in FTFW mode with a direct link, set "DisableExternalCache", and additonally set "PollSecs 15" since without that it seems
>      >>>>       only new and destroyed connections are synced, but lifetime updates for existing connections do not propagate without polling.
>      >>>>       Maybe another way which e.g. may use XOR(src,dst) might work around tight synchronization requirements, or is it possible to always uses the "internal" source IP?
>      >>>>       Is anybody doing that with a custom BPF?
>      >>>>
>      >>>> 2. How to do failover in such cases?
>      >>>>       For failover we'd need to change these rules (if one node fails, the total-nodes will change).
>      >>>>       As an alternative, I found [1] which states multiple rules can be used and enabled / disabled,
>      >>>>       but does somebody know of a cleaner (and easier to read) way, also not costing extra performance?
>      >>>>
>      >>>> 3. We have several internal networks, which need to talk to each other (partially with firewall rules and NATting),
>      >>>>       so we'd also need similar rules there, complicating things more. That's why a cleaner way would be very welcome :-).
>      >>>>
>      >>>> 4. Another point is how to actually perform the failover. Classical cluster suites (corosync + pacemaker)
>      >>>>       are rather used to migrate services, but not to communicate node ids and number of total active nodes.
>      >>>>       They can probably be tricked into doing that somehow, but they are not designed this way.
>      >>>>       TIPC may be something to use here, but I found nothing "ready to use".
>      >>>>
>      >>>> You may also tell me there's a better way to do this than use xt_cluster (custom BPF?) — we've up to now only done "classic" active/passive setups,
>      >>>> but maybe someone on this list has already done active/active without commercial hardware, and can share experience from this?
>      >>>>
>      >>>> Cheers and thanks in advance,
>      >>>>           Oliver
>      >>>>
>      >>>> PS: Please keep me in CC, I'm not subscribed to the list. Thanks!
>      >>>>
>      >>>> [0] https://github.com/torvalds/linux/blob/10a3efd0fee5e881b1866cf45950808575cb0f24/net/netfilter/xt_cluster.c#L16-L19 <https://github.com/torvalds/linux/blob/10a3efd0fee5e881b1866cf45950808575cb0f24/net/netfilter/xt_cluster.c#L16-L19>
>      >>>> [1] https://lore.kernel.org/netfilter-devel/499BEBBF.7080705@netfilter.org/ <https://lore.kernel.org/netfilter-devel/499BEBBF.7080705@netfilter.org/>
>      >>>>
>      >>>> --
>      >>>> Oliver Freyermuth
>      >>>> Universität Bonn
>      >>>> Physikalisches Institut, Raum 1.047
>      >>>> Nußallee 12
>      >>>> 53115 Bonn
>      >>>> --
>      >>>> Tel.: +49 228 73 2367
>      >>>> Fax:  +49 228 73 7869
>      >>>> --
>      >>>>
>      >>
>      >>
>      >> --
>      >> Oliver Freyermuth
>      >> Universität Bonn
>      >> Physikalisches Institut, Raum 1.047
>      >> Nußallee 12
>      >> 53115 Bonn
>      >> --
>      >> Tel.: +49 228 73 2367
>      >> Fax:  +49 228 73 7869
>      >> --
>      >>
> 
> 
>     -- 
>     Oliver Freyermuth
>     Universität Bonn
>     Physikalisches Institut, Raum 1.047
>     Nußallee 12
>     53115 Bonn
>     --
>     Tel.: +49 228 73 2367
>     Fax:  +49 228 73 7869
>     --
> 


-- 
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5432 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Running an active/active firewall/router (xt_cluster?)
  2021-05-10 22:58   ` Oliver Freyermuth
@ 2021-05-11  9:28     ` Oliver Freyermuth
  2021-05-11 12:24       ` Pablo Neira Ayuso
  0 siblings, 1 reply; 11+ messages in thread
From: Oliver Freyermuth @ 2021-05-11  9:28 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter

[-- Attachment #1: Type: text/plain, Size: 2084 bytes --]

Hi Pablo,

a short additional question after considering this for a while longer:

Am 11.05.21 um 00:58 schrieb Oliver Freyermuth:
>>> [...]
>>> Basic tests show that this works as expected, but the details get messy.
>>>
>>> 1. Certainly, conntrackd is needed to synchronize connection states.
>>>     But is it always "fast enough"?  xt_cluster seems to match by the
>>>     src_ip of the original direction of the flow[0] (if I read the code
>>>     correctly), but what happens if the reply to an outgoing packet
>>>     arrives at both firewalls before state is synchronized?
>>
>> You can avoid this by setting DisableExternalCache to off. Then, in
>> case one of your firewall node goes off, update the cluster rules and
>> inject the entries (via keepalived, or your HA daemon of choice).
>>
>> Recommended configuration is DisableExternalCache off and properly
>> configure your HA daemon to assist conntrackd. Then, the conntrack
>> entries in the "external cache" of conntrackd are added to the kernel
>> when needed.
> 
> You caused a classic "facepalming" moment. Of course, that will solve (1)
> completely. My initial thinking when disabling the external cache
> was before I understood how xt_cluster works, and before I found that it uses the direction
> of the flow, and then it just escaped my mind.
> Thanks for clearing this up! :-)

Thinking about this, the conntrack synchronization requirements would essentially be "zero",
since after a flow is established, it stays on the same machine, and conntrackd synchronization is only relevant
on failover — right?
So this approach would not limit / reduce the achievable bandwidth, since the only ingredient are the mangling filters —
so in case we can't go for dynamic routing with Quagga and hardware router stacks, this could even be a solution
for high bandwidths?

Cheers and thanks,
	Oliver

-- 
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5432 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Running an active/active firewall/router (xt_cluster?)
  2021-05-11  9:28     ` Oliver Freyermuth
@ 2021-05-11 12:24       ` Pablo Neira Ayuso
  2021-05-11 21:37         ` Paul Robert Marino
  0 siblings, 1 reply; 11+ messages in thread
From: Pablo Neira Ayuso @ 2021-05-11 12:24 UTC (permalink / raw)
  To: Oliver Freyermuth; +Cc: netfilter

Hi Oliver,

On Tue, May 11, 2021 at 11:28:23AM +0200, Oliver Freyermuth wrote:
> Hi Pablo,
> 
> a short additional question after considering this for a while longer:
> 
> Am 11.05.21 um 00:58 schrieb Oliver Freyermuth:
> > > > [...]
> > > > Basic tests show that this works as expected, but the details get messy.
> > > > 
> > > > 1. Certainly, conntrackd is needed to synchronize connection states.
> > > >     But is it always "fast enough"?  xt_cluster seems to match by the
> > > >     src_ip of the original direction of the flow[0] (if I read the code
> > > >     correctly), but what happens if the reply to an outgoing packet
> > > >     arrives at both firewalls before state is synchronized?
> > > 
> > > You can avoid this by setting DisableExternalCache to off. Then, in
> > > case one of your firewall node goes off, update the cluster rules and
> > > inject the entries (via keepalived, or your HA daemon of choice).
> > > 
> > > Recommended configuration is DisableExternalCache off and properly
> > > configure your HA daemon to assist conntrackd. Then, the conntrack
> > > entries in the "external cache" of conntrackd are added to the kernel
> > > when needed.
> > 
> > You caused a classic "facepalming" moment. Of course, that will solve (1)
> > completely. My initial thinking when disabling the external cache
> > was before I understood how xt_cluster works, and before I found that it uses the direction
> > of the flow, and then it just escaped my mind.
> > Thanks for clearing this up! :-)
> 
> Thinking about this, the conntrack synchronization requirements
> would essentially be "zero", since after a flow is established, it
> stays on the same machine, and conntrackd synchronization is only
> relevant on failover — right?

Well, you have to preventively synchronize states because you do not
know when your router will become unavailable, so one of the routers
in your pool takes over flows, right? So it depends on whether there
are HA requirements on your side for the existing flows.

> So this approach would not limit / reduce the achievable bandwidth,
> since the only ingredient are the mangling filters — so in case we
> can't go for dynamic routing with Quagga and hardware router stacks,
> this could even be a solution for high bandwidths?

I think so, yes. However, note that you're spending cycles to drop
packets that your node does not own though.

In case you have HA requirements, there is a number of trade-offs you
can apply to reduce the synchronization workload, for example, only
synchronize TCP established connections to reduce the amount of
messages between the two routers. There is also tuning that your could
explore: You could play with affinity to pin conntrackd into a CPU
core which is *not* used to handle NIC interruptions. IIRC, there is
-j CT action in iptables that allows to filter the netlink events that
are sent to userspace conntrackd (e.g. you could just send events for
"ct status assured" flows to userspace).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Running an active/active firewall/router (xt_cluster?)
  2021-05-11 12:24       ` Pablo Neira Ayuso
@ 2021-05-11 21:37         ` Paul Robert Marino
  0 siblings, 0 replies; 11+ messages in thread
From: Paul Robert Marino @ 2021-05-11 21:37 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Oliver Freyermuth, netfilter

Hey Oliver,
That is exactly right, and also the suggestion of using a switch stack
is also a redundancy thing because if one switch has a hardware
failure the other switch will still route data.
there is the additional possibility of trunking 2 interfaces across
the two switches (1 to each) in the stack which means if one of your
firewalls fails over 1 firewall could handle the full 20Gbps traffic
across the 2 10Gbps interfaces.

Also on a side note at the time we had chosen Avaya primarily for
latency reasons not price but when dealing with over 100 firewalls in
a mission critical environment the price thing was a nice for the
budget also being one of their biggest clients at the time we had
leverage with their dev team to get them to prioritize fixing our
issues :). that they are legitimately good switches that are under
rated with some cool features their shortest path bridging stuff is
really awesome for large scale networks.

On Tue, May 11, 2021 at 8:25 AM Pablo Neira Ayuso <pablo@netfilter.org> wrote:
>
> Hi Oliver,
>
> On Tue, May 11, 2021 at 11:28:23AM +0200, Oliver Freyermuth wrote:
> > Hi Pablo,
> >
> > a short additional question after considering this for a while longer:
> >
> > Am 11.05.21 um 00:58 schrieb Oliver Freyermuth:
> > > > > [...]
> > > > > Basic tests show that this works as expected, but the details get messy.
> > > > >
> > > > > 1. Certainly, conntrackd is needed to synchronize connection states.
> > > > >     But is it always "fast enough"?  xt_cluster seems to match by the
> > > > >     src_ip of the original direction of the flow[0] (if I read the code
> > > > >     correctly), but what happens if the reply to an outgoing packet
> > > > >     arrives at both firewalls before state is synchronized?
> > > >
> > > > You can avoid this by setting DisableExternalCache to off. Then, in
> > > > case one of your firewall node goes off, update the cluster rules and
> > > > inject the entries (via keepalived, or your HA daemon of choice).
> > > >
> > > > Recommended configuration is DisableExternalCache off and properly
> > > > configure your HA daemon to assist conntrackd. Then, the conntrack
> > > > entries in the "external cache" of conntrackd are added to the kernel
> > > > when needed.
> > >
> > > You caused a classic "facepalming" moment. Of course, that will solve (1)
> > > completely. My initial thinking when disabling the external cache
> > > was before I understood how xt_cluster works, and before I found that it uses the direction
> > > of the flow, and then it just escaped my mind.
> > > Thanks for clearing this up! :-)
> >
> > Thinking about this, the conntrack synchronization requirements
> > would essentially be "zero", since after a flow is established, it
> > stays on the same machine, and conntrackd synchronization is only
> > relevant on failover — right?
>
> Well, you have to preventively synchronize states because you do not
> know when your router will become unavailable, so one of the routers
> in your pool takes over flows, right? So it depends on whether there
> are HA requirements on your side for the existing flows.
>
> > So this approach would not limit / reduce the achievable bandwidth,
> > since the only ingredient are the mangling filters — so in case we
> > can't go for dynamic routing with Quagga and hardware router stacks,
> > this could even be a solution for high bandwidths?
>
> I think so, yes. However, note that you're spending cycles to drop
> packets that your node does not own though.
>
> In case you have HA requirements, there is a number of trade-offs you
> can apply to reduce the synchronization workload, for example, only
> synchronize TCP established connections to reduce the amount of
> messages between the two routers. There is also tuning that your could
> explore: You could play with affinity to pin conntrackd into a CPU
> core which is *not* used to handle NIC interruptions. IIRC, there is
> -j CT action in iptables that allows to filter the netlink events that
> are sent to userspace conntrackd (e.g. you could just send events for
> "ct status assured" flows to userspace).

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-05-11 21:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-09 17:52 Running an active/active firewall/router (xt_cluster?) Oliver Freyermuth
2021-05-10 16:57 ` Paul Robert Marino
2021-05-10 21:55   ` Oliver Freyermuth
2021-05-10 22:55     ` Paul Robert Marino
2021-05-10 23:21       ` Oliver Freyermuth
     [not found]         ` <CAPJdpdDNmTq_yafDU12w1xz7PUTm4zZr6vt2nGciv=baGYwP1A@mail.gmail.com>
2021-05-11  9:08           ` Oliver Freyermuth
2021-05-10 22:19 ` Pablo Neira Ayuso
2021-05-10 22:58   ` Oliver Freyermuth
2021-05-11  9:28     ` Oliver Freyermuth
2021-05-11 12:24       ` Pablo Neira Ayuso
2021-05-11 21:37         ` Paul Robert Marino

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.