All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] connlimit for nftables and limiting ct count by zone
@ 2017-10-16 14:42 Florian Westphal
  2017-10-27  0:49 ` Yi-Hung Wei
  0 siblings, 1 reply; 5+ messages in thread
From: Florian Westphal @ 2017-10-16 14:42 UTC (permalink / raw)
  To: netfilter-devel; +Cc: YiHung Wei, Justin Pettit, Joe Stringer

During NFWS we briefly discussed iptables '-m connlimit' and how
to apply this to nftables.

There is also a use case to make nf_conntrack_max more fine-grained
by making this setting apply per conntrack zone.

I'd like to ideally resolve both with a single solution.

>From nft (user front end) this would look like this:

add rule inet filter input ct state new ct count { ip saddr } gt 100 drop

The part enclosed in { } denotes the key/grouping that should be applied.

e.g.

ct count { ip saddr & 255.255.255.0 }   # count by source network
ct count { ip saddr & 255.255.255.0 . tcp dport }   # count by source network and service
ct count { ct zone } # count by zone

For this to work there are several issues that need to be resolved.

1. xt_connlimit must be split into an iptables part and a 'nf_connlimit'
   backend.

   nf_connlimit.c would implement the main function:

unsigned int nf_connlimit_count(struct nf_connlimit_data *,
				const struct nf_conn *conn,
				const void *key,
				u16 keylen);

Where 'nf_connlimit_data' is a structure that contains the (internal)
bookkeeping structure(s), conn is the connection that is to be counted,
and key/keylen is the (arbitrary) identifier to be used for grouping
connections.

xt_connlimits match function would then build a 'u32 key[5]' based on
the options the user provided on the iptables command line, i.e.
the conntrack zone and then either a source or destination network
(or address).

2. nftables can add a very small function to nft_ct.c expression that
hands a source register off as *key, and places result (the number of
connections) into a destination register.

In the iptables/nftables case, the struct nf_connlimit_data * would be
attached to the match/expression, i.e. there can be multiple such
count-functions at the same time.

3. Other users, such as ovs, could also call this api, in the case of
per-zone limiting key would simply be a u16 containing the zone identifier.

However, 2 and 3 make further internal changes necessary.

Right now, connlimit performs garbage collection from the packet path.
This isn't a big deal now, as we limit based by single ip address or network
only, the limit will be small.

bookkeeping looks like this:

      Node
      / \
   Node  Node  -> conn1 -> conn2 -> conn3
         /
      Node

Each node contains a single-linked list of connections that share the
same source (address/network).

When searching for the Node, all Nodes traversed get garbage-collected,
i.e. connlimit walks the hlist attached to the node and removes any tuple
no longer stored in the main conntrack table.  If the node then has
empty list, it gets erased from the tree.

But if we limit by zone then its entirely reasonable to have a limit
of e.g.  10k per zone, i.e. each Node could have a list consisting
of 10k elements.  Walking a 10k list is not acceptable from packet path.

Instead, I propose to store a timestamp of last gc in each node, so
we can skip intermediate nodes that had a very recent gc run.

If we find the node (zone, source network, etc. etc) that we should
count the new connection for, then do an on-demand garbage collection.

This is unfortunately required, we don't want to say '150' if the limit
is 150 then the new connection would be dropped, unless we really still
have 150 active connections.

To resolve this, i suggest to store the count in the node itself
(so we do not need to walk the full list of individual connections in the
 packet path).

The hlist would be replaced with another tree:

This allows us to quickly find out if the tuple we want to count now
is already stored (e.g. because of address/port reuse) or not.

It also permits us to check all tuples we see while searching for
the 'new' tuple that should be stored in the subtree and remove those that
are no longer in the main conntrack table.
We could also abort a packet-path gc run if we found at least one old
connection.

A work queue will take care of periodically scanning the main tree
and all sub-trees.  This workqueue would not disable bh for long times
as to not impact normal network processing.

I also considered an in-kernel notifier api for DESTROY events instead
of a gc scheme but it seems it would be more complicated (and
requires changes in the conntrack core).  We would also need to generate
such a destroy even for all conntracks, and not just those that were
committed to the main table.

nftables flow (dynset) or set infrastructure doesn't appear to be of any
use since they address different problem/use cases
(unless I missed something obvious).

Thoughts?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] connlimit for nftables and limiting ct count by zone
  2017-10-16 14:42 [RFC] connlimit for nftables and limiting ct count by zone Florian Westphal
@ 2017-10-27  0:49 ` Yi-Hung Wei
  2017-10-30 10:40   ` Florian Westphal
  0 siblings, 1 reply; 5+ messages in thread
From: Yi-Hung Wei @ 2017-10-27  0:49 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel, YiHung Wei, Justin Pettit, Joe Stringer

On Mon, Oct 16, 2017 at 7:42 AM, Florian Westphal <fw@strlen.de> wrote:
> During NFWS we briefly discussed iptables '-m connlimit' and how
> to apply this to nftables.
>
> There is also a use case to make nf_conntrack_max more fine-grained
> by making this setting apply per conntrack zone.
>
> I'd like to ideally resolve both with a single solution.
Thanks for the proposal. It would be great to be able to resolve the
two issues at the same time.


> From nft (user front end) this would look like this:
>
> add rule inet filter input ct state new ct count { ip saddr } gt 100 drop
>
> The part enclosed in { } denotes the key/grouping that should be applied.
>
> e.g.
>
> ct count { ip saddr & 255.255.255.0 }   # count by source network
> ct count { ip saddr & 255.255.255.0 . tcp dport }   # count by source network and service
> ct count { ct zone } # count by zone
Does it mean that all the ct zones will have the same limit? If it is,
shall we extend this syntax to support different limit for different
zone? How would we use nft to get the ct limit and ct count on
particular key ? It may be useful for debugging and monitoring
purpose?

Other than using nft to configure ct limit, if other user space
utilities want to set/get the ct limit or query the current ct count
would there be any API to achieve that?


> For this to work there are several issues that need to be resolved.
>
> 1. xt_connlimit must be split into an iptables part and a 'nf_connlimit'
>    backend.
>
>    nf_connlimit.c would implement the main function:
>
> unsigned int nf_connlimit_count(struct nf_connlimit_data *,
>                                 const struct nf_conn *conn,
>                                 const void *key,
>                                 u16 keylen);
>
> Where 'nf_connlimit_data' is a structure that contains the (internal)
> bookkeeping structure(s), conn is the connection that is to be counted,
> and key/keylen is the (arbitrary) identifier to be used for grouping
> connections.
>
> xt_connlimits match function would then build a 'u32 key[5]' based on
> the options the user provided on the iptables command line, i.e.
> the conntrack zone and then either a source or destination network
> (or address).
It is not clear to me that why we have 'u32 key[5]'. Is it because the
key would be the combination of Ip/ipv6 src or dst address with mask
(up to 16bytes), src or dst port (2 bytes), and ct zone (2 bytes)?

Do we need to specify the order for the key? How do we distinguish the
case where we have zone (u16) as a key vs. dport (u16) as a key?


> 3. Other users, such as ovs, could also call this api, in the case of
> per-zone limiting key would simply be a u16 containing the zone identifier.
I am wondering if there will be APIs to set and get the conntrack
limit on particular key, and also query the ct count on particular key
in the kernel space?


> Right now, connlimit performs garbage collection from the packet path.
> This isn't a big deal now, as we limit based by single ip address or network
> only, the limit will be small.
>
> bookkeeping looks like this:
>
>       Node
>       / \
>    Node  Node  -> conn1 -> conn2 -> conn3
>          /
>       Node
>
> Each node contains a single-linked list of connections that share the
> same source (address/network).
>
> When searching for the Node, all Nodes traversed get garbage-collected,
> i.e. connlimit walks the hlist attached to the node and removes any tuple
> no longer stored in the main conntrack table.  If the node then has
> empty list, it gets erased from the tree.
Thanks for the explanation. It saves me quite some time to understand
the code structure there.


> But if we limit by zone then its entirely reasonable to have a limit
> of e.g.  10k per zone, i.e. each Node could have a list consisting
> of 10k elements.  Walking a 10k list is not acceptable from packet path.
>
> Instead, I propose to store a timestamp of last gc in each node, so
> we can skip intermediate nodes that had a very recent gc run.
>
> If we find the node (zone, source network, etc. etc) that we should
> count the new connection for, then do an on-demand garbage collection.
>
> This is unfortunately required, we don't want to say '150' if the limit
> is 150 then the new connection would be dropped, unless we really still
> have 150 active connections.
>
> To resolve this, i suggest to store the count in the node itself
> (so we do not need to walk the full list of individual connections in the
>  packet path).
>
> The hlist would be replaced with another tree:
>
> This allows us to quickly find out if the tuple we want to count now
> is already stored (e.g. because of address/port reuse) or not.
>
> It also permits us to check all tuples we see while searching for
> the 'new' tuple that should be stored in the subtree and remove those that
> are no longer in the main conntrack table.
> We could also abort a packet-path gc run if we found at least one old
> connection.
I agree that it is inefficient to do GC if we host say 10k connections
in the hlist within a tree node. I think all the aforementioned
optimization mechanisms such as storing 'timestamp' and 'count' in the
tree node, aborting if we find at least one old connections, and
replacing hlist with a tree, are useful. Basically, the first two
techniques will reduce unnecessarily GC on the packet path, and
replacing hlist with a rb_tree will further reduce the insertion time
to the tree node from O(n) to O(log n), i.e. n is # of connections in
the hlist.

Just as you mentioned in the DESTORY event approach. I am wondering if
we can store an extra pointer to the tree node maybe in struct
nf_ct_ext, so that we can associate nf_conn with the tree node and
increase the count when the connection is confirmed, and we will
reduce the count when the connection is destroyed. In this way, with
some extra space, the insertion and updating the count in the tree
node is O(1) and we can get rid of GC scheme. I am not sure how much
effort does it take to modify the conntrack core tho.

Thanks,

-Yi-Hung

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] connlimit for nftables and limiting ct count by zone
  2017-10-27  0:49 ` Yi-Hung Wei
@ 2017-10-30 10:40   ` Florian Westphal
  2017-10-31 23:50     ` Yi-Hung Wei
  0 siblings, 1 reply; 5+ messages in thread
From: Florian Westphal @ 2017-10-30 10:40 UTC (permalink / raw)
  To: Yi-Hung Wei
  Cc: Florian Westphal, netfilter-devel, YiHung Wei, Justin Pettit,
	Joe Stringer

Yi-Hung Wei <yihung.wei@gmail.com> wrote:
> On Mon, Oct 16, 2017 at 7:42 AM, Florian Westphal <fw@strlen.de> wrote:
> > From nft (user front end) this would look like this:
> >
> > add rule inet filter input ct state new ct count { ip saddr } gt 100 drop
> >
> > The part enclosed in { } denotes the key/grouping that should be applied.
> >
> > e.g.
> >
> > ct count { ip saddr & 255.255.255.0 }   # count by source network
> > ct count { ip saddr & 255.255.255.0 . tcp dport }   # count by source network and service
> > ct count { ct zone } # count by zone
> Does it mean that all the ct zones will have the same limit?

Yes, for now.  I in nft the right hand side of compare operations in nft
needs to be an immediate.

> shall we extend this syntax to support different limit for different
> zone?

No, I don't think we should do this at the moment.

I think we could possible use maps to fetch values based on a key, e.g.

nft add map filter zonelimits { type typeof(ct zone) : typeof(ct count); flags interval\;}
nft add element filter zonelimits { 1-100 : 50000 }
nft add element filter zonelimits { 101 : 60000 }

nft add rule filter prerouting \
  ct count { ip saddr & 255.255.255.0 } gt { ct zone }

None of this is implemented or planned, just an idea.

Don't think we should bother about this at this time and
focus on the conntrack counting infrastructure first.

> How would we use nft to get the ct limit and ct count on
> particular key ? It may be useful for debugging and monitoring
> purpose?

Really?  So far my impression was that ovs does not wish
to interact with nft in any way.

> Other than using nft to configure ct limit, if other user space
> utilities want to set/get the ct limit or query the current ct count
> would there be any API to achieve that?

I think that ovs would hook into the api that nft uses.
Its up to the user to make sure that whatever limits are configured in
nft don't mess with ovs.

The other alternative of course is to add a new family and stick
NF_HOOK() calls into ovs...

> > 1. xt_connlimit must be split into an iptables part and a 'nf_connlimit'
> >    backend.
> >
> >    nf_connlimit.c would implement the main function:
> >
> > unsigned int nf_connlimit_count(struct nf_connlimit_data *,
> >                                 const struct nf_conn *conn,
> >                                 const void *key,
> >                                 u16 keylen);
> >
> > Where 'nf_connlimit_data' is a structure that contains the (internal)
> > bookkeeping structure(s), conn is the connection that is to be counted,
> > and key/keylen is the (arbitrary) identifier to be used for grouping
> > connections.
> >
> > xt_connlimits match function would then build a 'u32 key[5]' based on
> > the options the user provided on the iptables command line, i.e.
> > the conntrack zone and then either a source or destination network
> > (or address).
> It is not clear to me that why we have 'u32 key[5]'. Is it because the
> key would be the combination of Ip/ipv6 src or dst address with mask
> (up to 16bytes), src or dst port (2 bytes), and ct zone (2 bytes)?

Yes, xt_connlimit would have

u32 key[5];

key[0] = zone;
key[1] = iphdr->saddr;

nf_connlimit_count, ... key, 8);

key[2], [3], [4] are only needed for ipv6.

> Do we need to specify the order for the key? How do we distinguish the
> case where we have zone (u16) as a key vs. dport (u16) as a key?

The key is opaque to the backend, it just needs to this for memcmp.

> > 3. Other users, such as ovs, could also call this api, in the case of
> > per-zone limiting key would simply be a u16 containing the zone identifier.
> I am wondering if there will be APIs to set and get the conntrack
> limit on particular key,

No, this would not be provided by that backend, as the backend doesn't
know how the result is used.  It just returns the number of connections
that match the current key.

> and also query the ct count on particular key
> in the kernel space?

For what purpose would that be needed?

> Just as you mentioned in the DESTORY event approach. I am wondering if
> we can store an extra pointer to the tree node maybe in struct
> nf_ct_ext, so that we can associate nf_conn with the tree node and
> increase the count when the connection is confirmed, and we will
> reduce the count when the connection is destroyed. In this way, with
> some extra space, the insertion and updating the count in the tree
> node is O(1) and we can get rid of GC scheme. I am not sure how much
> effort does it take to modify the conntrack core tho.

There could be dozens of connlimit trees.

For example, a user could configure per-zone limit and a per-host limit
at the same time.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] connlimit for nftables and limiting ct count by zone
  2017-10-30 10:40   ` Florian Westphal
@ 2017-10-31 23:50     ` Yi-Hung Wei
  2017-11-01  1:26       ` Florian Westphal
  0 siblings, 1 reply; 5+ messages in thread
From: Yi-Hung Wei @ 2017-10-31 23:50 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel, YiHung Wei, Justin Pettit, Joe Stringer

> I think we could possible use maps to fetch values based on a key, e.g.
>
> nft add map filter zonelimits { type typeof(ct zone) : typeof(ct count); flags interval\;}
> nft add element filter zonelimits { 1-100 : 50000 }
> nft add element filter zonelimits { 101 : 60000 }
>
> nft add rule filter prerouting \
>   ct count { ip saddr & 255.255.255.0 } gt { ct zone }
>
> None of this is implemented or planned, just an idea.
>
> Don't think we should bother about this at this time and
> focus on the conntrack counting infrastructure first.
You are right, the first step should be focusing on the conntrack
counting infrastructure, and supporting that using maps looks handy.


>> How would we use nft to get the ct limit and ct count on
>> particular key ? It may be useful for debugging and monitoring
>> purpose?
>
> Really?  So far my impression was that ovs does not wish
> to interact with nft in any way.
I was thinking about how to use nft to get and the ct limit and ct
count not by ovs tho. Say a nft user may want to know how many active
connections are there in a particular zone.


>> It is not clear to me that why we have 'u32 key[5]'. Is it because the
>> key would be the combination of Ip/ipv6 src or dst address with mask
>> (up to 16bytes), src or dst port (2 bytes), and ct zone (2 bytes)?
>
> Yes, xt_connlimit would have
>
> u32 key[5];
>
> key[0] = zone;
> key[1] = iphdr->saddr;
>
> nf_connlimit_count, ... key, 8);
>
> key[2], [3], [4] are only needed for ipv6.
Thanks for explanation, that this part is clear now.


>> I am wondering if there will be APIs to set and get the conntrack
>> limit on particular key,
>
> No, this would not be provided by that backend, as the backend doesn't
> know how the result is used.  It just returns the number of connections
> that match the current key.
>
>> and also query the ct count on particular key
>> in the kernel space?
>
> For what purpose would that be needed?
I think as long as the user space utilities e.g. nft can query ct
limit, ct count, it is not necessarily to support that in the kernel
space.


>> Just as you mentioned in the DESTORY event approach. I am wondering if
>> we can store an extra pointer to the tree node maybe in struct
>> nf_ct_ext, so that we can associate nf_conn with the tree node and
>> increase the count when the connection is confirmed, and we will
>> reduce the count when the connection is destroyed. In this way, with
>> some extra space, the insertion and updating the count in the tree
>> node is O(1) and we can get rid of GC scheme. I am not sure how much
>> effort does it take to modify the conntrack core tho.
>
> There could be dozens of connlimit trees.
>
> For example, a user could configure per-zone limit and a per-host limit
> at the same time.
You are right, in this case we can store a list of pointers to
multiple trees. I do not have a strong opinion on this tho, just want
to check if it can help the design.

Thanks,

-Yi-Hung

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC] connlimit for nftables and limiting ct count by zone
  2017-10-31 23:50     ` Yi-Hung Wei
@ 2017-11-01  1:26       ` Florian Westphal
  0 siblings, 0 replies; 5+ messages in thread
From: Florian Westphal @ 2017-11-01  1:26 UTC (permalink / raw)
  To: Yi-Hung Wei
  Cc: Florian Westphal, netfilter-devel, YiHung Wei, Justin Pettit,
	Joe Stringer

Yi-Hung Wei <yihung.wei@gmail.com> wrote:
> > Really?  So far my impression was that ovs does not wish
> > to interact with nft in any way.
> I was thinking about how to use nft to get and the ct limit and ct
> count not by ovs tho. Say a nft user may want to know how many active
> connections are there in a particular zone.

They can do that with the conntrack tool ("conntrack -L --zone 42"),
we can extend the filtering capabilties if needed.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-11-01  1:26 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-16 14:42 [RFC] connlimit for nftables and limiting ct count by zone Florian Westphal
2017-10-27  0:49 ` Yi-Hung Wei
2017-10-30 10:40   ` Florian Westphal
2017-10-31 23:50     ` Yi-Hung Wei
2017-11-01  1:26       ` Florian Westphal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.