All of lore.kernel.org
 help / color / mirror / Atom feed
* iptables at scale
@ 2015-03-11 19:10 Glen Miner
  2015-03-11 20:15 ` Jan Engelhardt
       [not found] ` <CAJ26g5SeC-shnX+PT2nijqJMLk45zg8VVqvEVk7xG-6r7XgP-w@mail.gmail.com>
  0 siblings, 2 replies; 7+ messages in thread
From: Glen Miner @ 2015-03-11 19:10 UTC (permalink / raw)
  To: netfilter-devel

We built a proxy system on top of netfilter for our online game; it works pretty well but we've run into some problems at scale that I thought you might be interested in. I'm also keenly interested in any feedback or suggestions that you may have for scaling higher because the markets we're moving into need a lot more proxies than iptables can handle right now.

What it does: makes UDP NAT proxies so people with strict-NAT or multi-layer NAT can communicate. This essentially works out to be 4 iptables rules created for a pair of peers. They talk to our server and our server makes it look like they're talking to each other; fighting bad NAT with our NAT is somewhat ironic but it's quite elegant.

The good: the system load scales *very* well -- we have it servicing thousands of players and 40+mbit of game traffic on a single node right now and the CPU is 95% idle.

The bad: under load we're creating or deleting rules 10-20 times per second and want to scale that much higher. 

Our initial implementation used sudo iptables however this had considerable overhead; we were able to cut that time in half by converting 4 iptables calls into 1 iptables-restore with noflush. We then rewrote our server to run as root and use the native iptc APIs (knowing full well we're at the mercy of things changing) and this made things about 20x faster.

The ugly: as the number if iptables rules increases the time required to modify the table grows. At first this is linear but it starts to go super-linear after about 16,000 rules or so;  I assume it's blowing out a CPU cache.

Here's some real world numbers for creation time (again, note 1 proxy = 4 iptables rules)

The first 100 proxies take under 1ms each
At 750 proxies we're seeing them take 10ms each
At 4000 proxies we're at 70ms each
At 5000 proxies we're at 100-160ms each (it's erratic)

I can post a graph somewhere if people want to see it.

I did a bit of timing; for the 1st proxy created it's very fast:

104us iptc_init
19us  2x iptc_insert_entry and 2x iptc_append_entry
50us ipc_commit
65us iptc_free

At the 5000 proxy / 20,000 rule mark the timings are nearly 1000x longer; note timings are in milliseconds instead of microseconds:

38 ms for iptc_init("nat");
0.05 ms for 2x iptc_insert_entry and 2x iptc_append_entry
72 ms for iptc_commit
5.5 ms for iptc_free

My test machine is an old Intel(R) Pentium(R) D  CPU 2.66GHz (obviously we use big iron for production servers) and I've observed the same scaling problems in production.

At this point I'm getting desperate and am questioning my sanity -- looking at the iptc interfaces I just don't see how I could improve things -- even the allocator overhead for 20,000 rules is painful. The best I can do is a whole lot of async back flips to batch up operations that have accumulated while we were grinding in iptc_* and try to flush them through in a single init/commit -- I might be eating over 100ms of latency per batch but I might be able to sustain a higher throughput.

Will nftables scale any better? I'm not sure how much headache it would be to ride the bleeding edge but if that's what it takes I'll do it. 

Is there any way to shard the tables so that I can operate on smaller slices? I'm sure the answer here is 'no.'

I haven't looked at libiptc's internals -- I assume the problem is the current pattern of 'get it all, modify it, put it all back.' I'm guessing that since nobody has yet made it support incremental changes that this is probably not easy.

Thoughts, suggestions and criticism welcome.

-g



 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: iptables at scale
  2015-03-11 19:10 iptables at scale Glen Miner
@ 2015-03-11 20:15 ` Jan Engelhardt
  2015-03-11 20:30   ` Glen Miner
       [not found] ` <CAJ26g5SeC-shnX+PT2nijqJMLk45zg8VVqvEVk7xG-6r7XgP-w@mail.gmail.com>
  1 sibling, 1 reply; 7+ messages in thread
From: Jan Engelhardt @ 2015-03-11 20:15 UTC (permalink / raw)
  To: Glen Miner; +Cc: netfilter-devel


On Wednesday 2015-03-11 20:10, Glen Miner wrote:
>
>The ugly: as the number if iptables rules increases the time required to
>modify the table grows. At first this is linear but it starts to go
>super-linear after about 16,000 rules or so;  I assume it's blowing out a CPU
>cache.

Quite possibly. And even if one operating system's implementation can
do 16000, and the other 20000 before the turning point is reached, be
aware that the limiting factor is generally capacity. The problem is
not specific to networking. The same can be observed when there is
not enough RAM for an arbitrary task and pages get swapped out to
disk. To date, the only solution was to "add more capacity" or to
reduce/split the workload, as there is only so much vertical scaling
one can do.


>Will nftables scale any better?

iptc: Processes entire rulesets. Many non-present options
take up space anyway (as "::/0"). Sent to kernel as-is as one huge
chunk. "One huge" allocation (per NUMA node) to copy it into the
kernel. Not a lot of parsing needed. Ruleset is linear.

nft: Single rules can be updated. The \0 space wastage is elided.
Sent to kernel in Netlink packets (max. 64K by now, I think), so
there is a forth-and-back of syscalls. Kernel needs to deserialize.
(Last time I checked, rules are held in a linked list like in BSD
*pf, therefore many small allocations all over the place.

The word on the street is that nft's expressiveness allows you to
have fewer rules. Whether it can really be exploited to that level
ultimately depends on the ruleset and how much conditions you
can aggregate, innit.

In other words, if you really want to know, measure.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: iptables at scale
  2015-03-11 20:15 ` Jan Engelhardt
@ 2015-03-11 20:30   ` Glen Miner
  0 siblings, 0 replies; 7+ messages in thread
From: Glen Miner @ 2015-03-11 20:30 UTC (permalink / raw)
  To: netfilter-devel

>>The ugly: as the number if iptables rules increases the time required to
>>modify the table grows. At first this is linear but it starts to go
>>super-linear after about 16,000 rules or so; I assume it's blowing out a CPU
>>cache.
>
> Quite possibly. And even if one operating system's implementation can
> do 16000, and the other 20000 before the turning point is reached, be
> aware that the limiting factor is generally capacity. The problem is
> not specific to networking. The same can be observed when there is
> not enough RAM for an arbitrary task and pages get swapped out to
> disk. To date, the only solution was to "add more capacity" or to
> reduce/split the workload, as there is only so much vertical scaling
> one can do.

To be clear: judging by the scaling of kernel performance thus far the only bottleneck at this point is table modification. I've easily got 20x the headroom on the server as it is but am choked because I can't create rules fast enough. 

And yes -- we're ready to go wide -- but that has its own problems. I'd really like to find a way to get a lot better utilization per node. 

> nft: Single rules can be updated. The \0 space wastage is elided.
...
> In other words, if you really want to know, measure.

Ok thanks for the info -- I don't know enough of the state this package is in but I'll take a look. I'm not sure I can dump iptables out of my current system and drop in nftables instead but maybe I'll try tomorrow. If I get nftables working I'll definitely post scaling numbers.

-g 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: iptables at scale
       [not found] ` <CAJ26g5SeC-shnX+PT2nijqJMLk45zg8VVqvEVk7xG-6r7XgP-w@mail.gmail.com>
@ 2015-03-11 21:27   ` Glen Miner
  2015-03-11 21:44     ` Jan Engelhardt
  2015-03-16 20:52     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 7+ messages in thread
From: Glen Miner @ 2015-03-11 21:27 UTC (permalink / raw)
  To: netfilter-devel


>> What it does: makes UDP NAT proxies so people with strict-NAT or 
> multi-layer NAT can communicate. This essentially works out to be 4 
> iptables rules created for a pair of peers. They talk to our server and 
> our server makes it look like they're talking to each other; fighting 
> bad NAT with our NAT is somewhat ironic but it's quite elegant. 
> 
> What exactly are these four rules, and what do they accomplish? 

aAddr : aPort is player A's real ip : port
bAddr : bPort is player B's real ip : port
nAddr : anPort is what player B uses to talk to A
nAddr : bnPort is what player A uses to talk to B

echo "*nat
:PREROUTING - [0:0]
-I PREROUTING -p udp -s $aAddr --sport $aPort -d $nAddr --dport $bnPort -j DNAT --to $bAddr:$bPort
-I PREROUTING -p udp -s $bAddr --sport $bPort -d $nAddr --dport $anPort -j DNAT --to $aAddr:$aPort
:POSTROUTING - [0:0]
-A POSTROUTING -p udp -s $aAddr --sport $aPort -d $bAddr --dport $bPort  -j SNAT --to $nAddr:$anPort
-A POSTROUTING -p udp -s $bAddr --sport $bPort -d $aAddr --dport $aPort  -j SNAT --to $nAddr:$bnPort
COMMIT
" | iptables-restore --noflush

> You won't be happy changing iptables rules on the fly for each 
> customer. Hire somebody who can program in C, point them at the 
> iptables kernel NAT module implementation, and maybe ipset, and get 
> them hacking away at a solution that works without table manipulation. 

I can program in C; like I said: I translated the above command + some conntrack junk to direct function calls to skip process creation and parsing logic for 20x better performance. However, it's not really clear where I'd go instead short of reinventing iptables / conntrack and lord knows what else. Are you suggesting I write my own kernel hooks?

-g 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: iptables at scale
  2015-03-11 21:27   ` Glen Miner
@ 2015-03-11 21:44     ` Jan Engelhardt
  2015-03-11 21:54       ` Glen Miner
  2015-03-16 20:52     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 7+ messages in thread
From: Jan Engelhardt @ 2015-03-11 21:44 UTC (permalink / raw)
  To: Glen Miner; +Cc: netfilter-devel


On Wednesday 2015-03-11 22:27, Glen Miner wrote:
>> 
>> What exactly are these four rules, and what do they accomplish? 
>
>echo "*nat
>:PREROUTING - [0:0]
>-I PREROUTING -p udp -s $aAddr --sport $aPort -d $nAddr --dport $bnPort -j DNAT --to $bAddr:$bPort
>-I PREROUTING -p udp -s $bAddr --sport $bPort -d $nAddr --dport $anPort -j DNAT --to $aAddr:$aPort
>:POSTROUTING - [0:0]
>-A POSTROUTING -p udp -s $aAddr --sport $aPort -d $bAddr --dport $bPort  -j SNAT --to $nAddr:$anPort
>-A POSTROUTING -p udp -s $bAddr --sport $bPort -d $aAddr --dport $aPort  -j SNAT --to $nAddr:$bnPort
>COMMIT
>" | iptables-restore --noflush
>
>> You won't be happy changing iptables rules on the fly for each 
>> customer. Hire somebody who can program in C, point them at the 
>> iptables kernel NAT module implementation, and maybe ipset, and get 
>> them hacking away at a solution that works without table manipulation. 
>
>I can program in C; like I said: I translated the above command + some
>conntrack junk to direct function calls to skip process creation and parsing
>logic for 20x better performance. However, it's not really clear where I'd go
>instead short of reinventing iptables / conntrack and lord knows what else.

If all you do is the NAT mappings, then directly using conntrack(8)
and/or libnetfilter_conntrack should suffice, especially since UDP CT
entries stay around until their known timeout.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: iptables at scale
  2015-03-11 21:44     ` Jan Engelhardt
@ 2015-03-11 21:54       ` Glen Miner
  0 siblings, 0 replies; 7+ messages in thread
From: Glen Miner @ 2015-03-11 21:54 UTC (permalink / raw)
  To: netfilter-devel

> If all you do is the NAT mappings, then directly using conntrack(8)
> and/or libnetfilter_conntrack should suffice, especially since UDP CT
> entries stay around until their known timeout.

Hmm. That's basically what you said back in January:

List: netfilter
Subject: Re: Stateless NAT with iptables
From: Jan Engelhardt
Date: 2015-01-09 23:54:16

http://marc.info/?l=netfilter&m=142084805119379&w=2

But I tried that and couldn't get it working; Marcelo Ricardo Leitner said

"Having the conntrack entry is not enough to get your packets NATed"

List: netfilter
Subject: Re: Stateless NAT with iptables
From: Marcelo Ricardo Leitner <marcelo.leitner () gmail ! com>
Date: 2015-01-12 22:06:31
Message-ID: 54B44567.2050707 () gmail ! com

Which seemed to match my observations. I'll go back in the hole if people here seem to think it's viable and I'm missing something, though.

Thanks for reading!
-g 		 	   		  

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: iptables at scale
  2015-03-11 21:27   ` Glen Miner
  2015-03-11 21:44     ` Jan Engelhardt
@ 2015-03-16 20:52     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2015-03-16 20:52 UTC (permalink / raw)
  To: Glen Miner; +Cc: brouer, netfilter-devel

On Wed, 11 Mar 2015 17:27:35 -0400
Glen Miner <shaggie76@hotmail.com> wrote:

> >> What it does: makes UDP NAT proxies so people with strict-NAT or 
> > multi-layer NAT can communicate. This essentially works out to be 4 
> > iptables rules created for a pair of peers. They talk to our server and 
> > our server makes it look like they're talking to each other; fighting 
> > bad NAT with our NAT is somewhat ironic but it's quite elegant. 
> > 
> > What exactly are these four rules, and what do they accomplish? 
> 
> aAddr : aPort is player A's real ip : port
> bAddr : bPort is player B's real ip : port
> nAddr : anPort is what player B uses to talk to A
> nAddr : bnPort is what player A uses to talk to B
> 
> echo "*nat
> :PREROUTING - [0:0]
> -I PREROUTING -p udp -s $aAddr --sport $aPort -d $nAddr --dport $bnPort -j DNAT --to $bAddr:$bPort
> -I PREROUTING -p udp -s $bAddr --sport $bPort -d $nAddr --dport $anPort -j DNAT --to $aAddr:$aPort
> :POSTROUTING - [0:0]
> -A POSTROUTING -p udp -s $aAddr --sport $aPort -d $bAddr --dport $bPort  -j SNAT --to $nAddr:$anPort
> -A POSTROUTING -p udp -s $bAddr --sport $bPort -d $aAddr --dport $aPort  -j SNAT --to $nAddr:$bnPort
> COMMIT
> " | iptables-restore --noflush

It looks like you will have a very long list of rules in PREROUTING and
POSTROUTING.  You will likely have a scalability issues due to this, as
iptables will process these rules linear.  Thus, as you add more rules,
the overhead per packet increase.  You are basically shooting yourself
in the foot, by building your ruleset like this.

I solved the iptables update speed back in 2007-2008, and also had a
method for building a subnet-skeleton search tree with iptables chains,
what reduced the lookup complexity from O(n) to O(log N).

See my user-presentation:
 http://people.netfilter.org/hawk/presentations/nfws2008/nfws2008_userday_iptables_scale.pdf

See my developer presentation:
 http://people.netfilter.org/hawk/presentations/nfws2008/nfws2008_developers_iptables_scale.pdf

The iptables update speed, I on-purpose "only" fixed for
chain-lookups.  I left-in the scalability problem of updating a very
long list-of-rules in a single chain, because if someone would do that
(like you) they should experience this slowdown, and hopefully stop to
think about weather they are doing some thing wrong, like you ;-).  (As
you are only inserting or adding rules, not adding or deleting rules in
the middle of the chain, you are not hitting the really bad parts).


> > You won't be happy changing iptables rules on the fly for each 
> > customer. Hire somebody who can program in C, point them at the 
> > iptables kernel NAT module implementation, and maybe ipset, and get 
> > them hacking away at a solution that works without table manipulation. 
> 
> I can program in C; like I said: I translated the above command +
> some conntrack junk to direct function calls to skip process creation
> and parsing logic for 20x better performance. However, it's not
> really clear where I'd go instead short of reinventing iptables /
> conntrack and lord knows what else. Are you suggesting I write my own
> kernel hooks?

With my experience today, I would probably have written a simple
netfilter/iptables module that would do the work, having a hash-table
to match my rules, and be externally configurable via netlink messages
(to add and remove customers).  (Which is basically what Jan said about
hiring somebody with experience).

Today, we have nftables.  I'm not very experienced with nftables, but I
think that you problem might be solvable by using nftables "set"
functionality.  Reducing your entire ruleset to 4 nftables "set" rules.

A third option, is using conntrack-tools [1], like Jan also mentioned.
[1] http://netfilter.org/projects/conntrack-tools/index.html

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-03-16 20:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-11 19:10 iptables at scale Glen Miner
2015-03-11 20:15 ` Jan Engelhardt
2015-03-11 20:30   ` Glen Miner
     [not found] ` <CAJ26g5SeC-shnX+PT2nijqJMLk45zg8VVqvEVk7xG-6r7XgP-w@mail.gmail.com>
2015-03-11 21:27   ` Glen Miner
2015-03-11 21:44     ` Jan Engelhardt
2015-03-11 21:54       ` Glen Miner
2015-03-16 20:52     ` Jesper Dangaard Brouer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.