[RFC] High Performance Packet Classifiction for tc framework

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC] High Performance Packet Classifiction for tc framework
@ 2003-07-14  8:45 Michael Bellion and Thomas Heinz
  2003-07-16  5:02 ` jamal
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Bellion and Thomas Heinz @ 2003-07-14  8:45 UTC (permalink / raw)
  To: linux-kernel, linux-net, netdev

Hi

We are planning to port our HIPAC algorithm to the tc framework and we
ask you for some comments about several issues.

HIPAC is a novel packet classification framework which replaces the
common linear approach with a more advanced data structure
which offers highly efficient and locking friendly packet matching.

Currently, people using lots of filters suffer from a big performance
penalty. In contrast, HIPAC is able to handle thousands of filters
without much slowdown compared to having no filters at all.

There already exists one application of the HIPAC algorithm for the
netfilter framework: nf-hipac. Details about the project can be found
at our website http://www.hipac.org or our sourceforge project page at
http://www.sourceforge.net/projects/nf-hipac

Several performance tests of nf-hipac have been done so far (see our
website) and have proven our claims. So it would be great if tc users
could benefit from HIPAC too.

Certainly, we'd like to know first whether HIPAC makes sense for the
tc framework at all. From the nf-hipac worst case performance tests
we know that our algorithm should be faster in all cases as soon as
you have approx. 20 filters. Below 20 filters there is no difference
between nf-hipac and the iptables filter table.
So basically the question is: Are people using the tc framework with
lots of filters? Some numbers would be helpful.

Since we can only improve performance of u32 and fw filters it's also
interesting whether such rulesets typically consist of those filters
in the main.

The tc framework is very flexible with respect to where filters can be
attached. Unfortunately this cannot be mapped into one HIPAC data
structure. Our current design allows to attach filters anywhere but
only the filters attached to the top level qdisc would benefit from the
HIPAC algorithm. Would this be a noticeable restriction?

Here is a short overview of the main design goals:

- new qdisc for HIPAC which is basically a container for the filters;
  it can only be attached as top level qdisc
- new HIPAC classifier which supports all native nf-hipac matches
  (src/dst ip, proto, src/dst port, ttl, state, in_iface, icmp type,
  tcpflags, fragments) and additionally fwmark
- the HIPAC classifier can only be attached to the HIPAC qdisc and vice
  versa the HIPAC qdisc only accepts HIPAC classifiers
- the HIPAC qdisc consists of only one single class to which the "next"
  qdisc must be attached
- the HIPAC classifier can contain a number of existing classifiers
  (u32, fw, route, rsvp, tcindex) whereby the semantics is as follows:
  a HIPAC classifier matches if the native matches and also each of the
  embedded classifiers match; the returned tcf_result is the one from
  the final classifier (=> intermediate classifiers are reduced to a
  match)
- it is still possible to attach non-hipac classifiers to other qdiscs
  and classes

Regards,

+-----------------------+----------------------+
|   Michael Bellion     |     Thomas Heinz     |
| <mbellion@hipac.org>  |  <creatix@hipac.org> |
+-----------------------+----------------------+

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-07-14  8:45 [RFC] High Performance Packet Classifiction for tc framework Michael Bellion and Thomas Heinz
@ 2003-07-16  5:02 ` jamal
  2003-07-17 13:13   ` Michael Bellion and Thomas Heinz
  0 siblings, 1 reply; 24+ messages in thread
From: jamal @ 2003-07-16  5:02 UTC (permalink / raw)
  To: nf; +Cc: linux-net, netdev

Hi Michael,

I noticed you guys like to post to the kernel list at times without
even ccing netdev based on my browsing just now. Please send msgs to
netdev first; ccing lk is optional.Infact i took it out of the cc.

On Mon, 2003-07-14 at 04:45, Michael Bellion and Thomas Heinz wrote:
> Hi
> 
> We are planning to port our HIPAC algorithm to the tc framework and we
> ask you for some comments about several issues.
> 

This is good.I may have emailed you about this topic before?

[..]

> 
> Certainly, we'd like to know first whether HIPAC makes sense for the
> tc framework at all. 

It's a classifier therefore it makes sense ;->

> From the nf-hipac worst case performance tests
> we know that our algorithm should be faster in all cases as soon as
> you have approx. 20 filters. Below 20 filters there is no difference
> between nf-hipac and the iptables filter table.

nice. What would be interesting is to see your rule update rates vs
iptables (i expect iptables to suck) - but how do you compare aginst any
of the tc classifiers for example?

> So basically the question is: Are people using the tc framework with
> lots of filters? Some numbers would be helpful.
> 

I am not sure anybody could give you numbers; i have seen a posting once
which talked about 1K rules. I hardly use more than 10 in my setup
however there amy be people who will be looking (even if it is for
benchmarking purposes) in the 100K range. Short answer - go for the max
you can.

> Since we can only improve performance of u32 and fw filters it's also
> interesting whether such rulesets typically consist of those filters
> in the main.
> 

Actually theres a _lot_ of room for improvement for u32. I have played
with a few tricks which will greatly up the numbers for high number of
rules.

> The tc framework is very flexible with respect to where filters can be
> attached. Unfortunately this cannot be mapped into one HIPAC data
> structure. Our current design allows to attach filters anywhere but
> only the filters attached to the top level qdisc would benefit from the
> HIPAC algorithm. Would this be a noticeable restriction?
> 

I dont think so, but can ytou describe this restriction?

> 
> Here is a short overview of the main design goals:
> 
> - new qdisc for HIPAC which is basically a container for the filters;
>   it can only be attached as top level qdisc

why?

> - new HIPAC classifier which supports all native nf-hipac matches
>   (src/dst ip, proto, src/dst port, ttl, state, in_iface, icmp type,
>   tcpflags, fragments) and additionally fwmark

I would think for cleanliness fwmark or any metadata related
classification would be separate from one that is based on packet bits.

> - the HIPAC classifier can only be attached to the HIPAC qdisc and vice
>   versa the HIPAC qdisc only accepts HIPAC classifiers

<puke> 
We do have an issue with being able to do extended classification
but building a qdisc for it is a no no. Building a qdisc that will force
other classifier to structure themselves after it is even a bigger sin.
Look at the action code i have (i can send you an updated patch); a
better idea is to make extended classifiers an action based on another
filter match. At least this is what i have been toying with and i dont
think it is clean enough. what we need is to extend the filtering
framework itself to have extended classifiers.

> - the HIPAC qdisc consists of only one single class to which the "next"
>   qdisc must be attached
> - the HIPAC classifier can contain a number of existing classifiers
>   (u32, fw, route, rsvp, tcindex) whereby the semantics is as follows:
>   a HIPAC classifier matches if the native matches and also each of the
>   embedded classifiers match; the returned tcf_result is the one from
>   the final classifier (=> intermediate classifiers are reduced to a
>   match)
> - it is still possible to attach non-hipac classifiers to other qdiscs
>   and classes
> 

Please junk this idea. Study the classification framework code and come
up with something that is based on that. Second best option is to take
a look at the action code. We dont want to force other classifiers in
research to be constrained to your structures. This is not trying to put
down work you are doing - i think you have something good.

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-07-16  5:02 ` jamal
@ 2003-07-17 13:13   ` Michael Bellion and Thomas Heinz
  2003-08-03 18:14     ` jamal
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Bellion and Thomas Heinz @ 2003-07-17 13:13 UTC (permalink / raw)
  To: hadi; +Cc: linux-net, netdev

[-- Attachment #1: Type: text/plain, Size: 5761 bytes --]

Hi Jamal

You wrote:
> This is good.I may have emailed you about this topic before?

Yes, but at that time we had not any concrete plans to
integrate hipac into tc. We focussed on making nf-hipac as
expressive as iptables first.

> It's a classifier therefore it makes sense ;->

:-)

> nice. What would be interesting is to see your rule update rates vs
> iptables (i expect iptables to suck) - but how do you compare aginst any
> of the tc classifiers for example?

Regarding the rule update rates we have not done any measurements
yet but nf-hipac should be faster than iptables (even more when
we have implemented the selective cloning stuff). On the other
hand we are probably slower than tc because in addition to the
insert operation into an internal chain there is the actual hipac
insert operation. The insertion in the internal chain is quicker
than the tc insert operation because we use doubly linked lists.

Regarding the matching performance one has to consider a few things.
The currently existing tc classifiers are an abstraction for rules
(iptables "slang") whilst hipac is an abstraction for a set of rules
(including the chain semantics known from iptables), i.e. a table in
the iptables world. Of course it is possible to have some sort
of extended classifying in tc too, i.e. you can add several fw or u32
filters with the same prio which allows the filters to be hashed.
One disadvantage of this concept is that the hashed filters
must be compact, i.e. there cannot be other classifiers in between.
Another major disadvantage is caused by the hashing scheme.
If you use the hash for 1 dimension you have to make sure that
either all filters in a certain bucket are disjoint or you must have
an implicit ordering of the rules (according to the insertion order
or something). This scheme is not extendable to 2 or more dimensions,
i.e. 1 hash for src ip, #(src ip buckets) many dst ip hashes and so
on, because you simply cannot express arbitrary rulesets.

Another general problem is of course that the user has to manually
setup the hash which is rather inconvenient.

Now, what are the implications on the matching performance:
tc vs. nf-hipac? As long as the extended hashing stuff is not used
nf-hipac is clearly superior to tc. When hashing is used it _really_
depends. If there is only one classifier (with hashing) per interface
and the number of rules per bucket is very small the performance should
be comparable. As soon as you add other classifiers nf-hipac will
outperform tc again.

>>The tc framework is very flexible with respect to where filters can be
>>attached. Unfortunately this cannot be mapped into one HIPAC data
>>structure. Our current design allows to attach filters anywhere but
>>only the filters attached to the top level qdisc would benefit from the
>>HIPAC algorithm. Would this be a noticeable restriction?
> 
> I dont think so, but can ytou describe this restriction?

Well, we thought a little more about the design and came to the
conclusion that it is not necessary to have a HIPAC qdisc at root
but it suffices to ensure that the HIPAC classifier occurs only
once per interface. As you can guess from the last sentence we
dropped the HIPAC qdisc design and changed to the following scheme:

- there no special HIPAC qdisc at all :-)
- the HIPAC classifier is no longer a simple rule but represents
   the whole table
- the HIPAC classifier can occur in any qdisc but at most once
   per interface

So, basically HIPAC is just a normal classifier like any other
with two exceptions:
   a) it can occur only once per interface
   b) the rules within the classifier can contain other classifiers,
      e.g. u32, fw, tc_index, as matches

There is just one problem with the current tc framework. Once
a new filter is inserted into the chain it is not removed even
if the change function of the classifier returns < 0
(2.6.0-test1: net/sched/cls_api.c: line 280f).
This should be changed anyway, shouldn't it?

>>- new HIPAC classifier which supports all native nf-hipac matches
>>  (src/dst ip, proto, src/dst port, ttl, state, in_iface, icmp type,
>>  tcpflags, fragments) and additionally fwmark
> 
> I would think for cleanliness fwmark or any metadata related
> classification would be separate from one that is based on packet bits.

Since our classifier represents a table of rules and the rules are
based on different matches, like src/dst ip and also fwmark (native)
or u32 (subclassifier as match), this is definitely a clean design.

>>- the HIPAC classifier can only be attached to the HIPAC qdisc and vice
>>  versa the HIPAC qdisc only accepts HIPAC classifiers
> 
> <puke> 
> We do have an issue with being able to do extended classification
> but building a qdisc for it is a no no. Building a qdisc that will force
> other classifier to structure themselves after it is even a bigger sin.
> Look at the action code i have (i can send you an updated patch); a
> better idea is to make extended classifiers an action based on another
> filter match. At least this is what i have been toying with and i dont
> think it is clean enough. what we need is to extend the filtering
> framework itself to have extended classifiers.

The new design should be much cleaner. Originally we also thought about
making HIPAC a classifier only but we expected some problems related
to this approach. Finally we discovered that this is not the case :)

Regards,

+-----------------------+----------------------+
|   Michael Bellion     |     Thomas Heinz     |
| <mbellion@hipac.org>  |  <creatix@hipac.org> |
+-----------------------+----------------------+
|    High Performance Packet Classification    |
|       nf-hipac: http://www.hipac.org/        |
+----------------------------------------------+

[-- Attachment #2: Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-07-17 13:13   ` Michael Bellion and Thomas Heinz
@ 2003-08-03 18:14     ` jamal
  2003-08-04 13:17       ` Michael Bellion and Thomas Heinz
  0 siblings, 1 reply; 24+ messages in thread
From: jamal @ 2003-08-03 18:14 UTC (permalink / raw)
  To: Michael Bellion and Thomas Heinz; +Cc: linux-net, netdev

Hi,

Apologies for late response. Its funny how i thought i was going to have
more time in the last 2 weeks but due to bad scheduling that wasnt the
case.

On Thu, 2003-07-17 at 09:13, Michael Bellion and Thomas Heinz wrote:
> Hi Jamal
> 
> You wrote:
> > This is good.I may have emailed you about this topic before?
> 
> Yes, but at that time we had not any concrete plans to
> integrate hipac into tc. We focussed on making nf-hipac as
> expressive as iptables first.
> 

Good goal.

> > It's a classifier therefore it makes sense ;->
> 
> :-)
> 
> > nice. What would be interesting is to see your rule update rates vs
> > iptables (i expect iptables to suck) - but how do you compare aginst any
> > of the tc classifiers for example?
> 
> Regarding the rule update rates we have not done any measurements
> yet but nf-hipac should be faster than iptables (even more when
> we have implemented the selective cloning stuff). On the other
> hand we are probably slower than tc because in addition to the
> insert operation into an internal chain there is the actual hipac
> insert operation. The insertion in the internal chain is quicker
> than the tc insert operation because we use doubly linked lists.
> 

I think i will have to look at your code to make comments.

> Regarding the matching performance one has to consider a few things.
> The currently existing tc classifiers are an abstraction for rules
> (iptables "slang") whilst hipac is an abstraction for a set of rules
> (including the chain semantics known from iptables), i.e. a table in
> the iptables world. 

Not entirely accurate. Depends which tc classifier. u32 hash tables are
infact like iptables chains.
Note, the concept of priorities which is used for conflict resolution as
well as further separating sets of rules doesnt exist in iptables.

> Of course it is possible to have some sort
> of extended classifying in tc too, 

True, i overlooked this. 

> i.e. you can add several fw or u32
> filters with the same prio which allows the filters to be hashed.

You can also have them use different priorities and with the continue
operator first clasify based on packet data then on metadata or on
another packet header filter.

> One disadvantage of this concept is that the hashed filters
> must be compact, i.e. there cannot be other classifiers in between.

I didnt understand this. Are you talking about conflict resolving of
overlapping filters?

> Another major disadvantage is caused by the hashing scheme.
> If you use the hash for 1 dimension you have to make sure that
> either all filters in a certain bucket are disjoint or you must have
> an implicit ordering of the rules (according to the insertion order
> or something). This scheme is not extendable to 2 or more dimensions,
> i.e. 1 hash for src ip, #(src ip buckets) many dst ip hashes and so
> on, because you simply cannot express arbitrary rulesets.

If i understood you - you are refering to a way to reduce the number of
lookups by having disjoint hashes. I suppose for something as simple as 
a five tuple lookup, this is almost solvable by hardcoding the different
fields into multiway hashes. Its when you try to generalize that it
becomes an issue. 

> Another general problem is of course that the user has to manually
> setup the hash which is rather inconvenient.
> 

Yes. Take a look at Werners tcng - he has a clever way to hide things
from the user. I did experimentation on u32 with a kernel thread which
rearranged things when they seemed out of balance but i havent
experimented with a lot of rules.

> Now, what are the implications on the matching performance:
> tc vs. nf-hipac? As long as the extended hashing stuff is not used
> nf-hipac is clearly superior to tc. 

You are refering to u32. You mean as long as u32 stored things in a
single linked list, you win - correct?

> When hashing is used it _really_
> depends. If there is only one classifier (with hashing) per interface
> and the number of rules per bucket is very small the performance should
> be comparable. As soon as you add other classifiers nf-hipac will
> outperform tc again.
> 

If we take a simple user interface abstraction like tcng which hides the
evil of u32 and then take simple 5 tuple rules - i doubt you will see
any difference. For more generic setup, the kernel thread i refer to
would work - but may slow insertion. 

> >>The tc framework is very flexible with respect to where filters can be
> >>attached. Unfortunately this cannot be mapped into one HIPAC data
> >>structure. Our current design allows to attach filters anywhere but
> >>only the filters attached to the top level qdisc would benefit from the
> >>HIPAC algorithm. Would this be a noticeable restriction?
> > 
> > I dont think so, but can ytou describe this restriction?
> 
> Well, we thought a little more about the design and came to the
> conclusion that it is not necessary to have a HIPAC qdisc at root
> but it suffices to ensure that the HIPAC classifier occurs only
> once per interface. As you can guess from the last sentence we
> dropped the HIPAC qdisc design and changed to the following scheme:
> 
> - there no special HIPAC qdisc at all :-)
> - the HIPAC classifier is no longer a simple rule but represents
>    the whole table
> - the HIPAC classifier can occur in any qdisc but at most once
>    per interface
>
> So, basically HIPAC is just a normal classifier like any other
> with two exceptions:
>    a) it can occur only once per interface
>    b) the rules within the classifier can contain other classifiers,
>       e.g. u32, fw, tc_index, as matches
> 

But why restriction a)? 
Also why should we need hipac to hold other filters when the
infrastructure itself can hold the extended filters just fine?
I think you may actually be trying to say why somewhere in the email,
but it must not be making a significant impression on my brain.

> There is just one problem with the current tc framework. Once
> a new filter is inserted into the chain it is not removed even
> if the change function of the classifier returns < 0
> (2.6.0-test1: net/sched/cls_api.c: line 280f).
> This should be changed anyway, shouldn't it?
> 

Are you refering to this piece of code?:
----
        err = tp->ops->change(tp, cl, t->tcm_handle, tca, &fh);
        if (err == 0)
                tfilter_notify(skb, n, tp, fh, RTM_NEWTFILTER);

errout:
        if (cl)
                cops->put(q, cl);
        return err;
---

change() should not return <0 if it has installed the filter i think.
Should the top level code be responsible for removing filters?

> >>- new HIPAC classifier which supports all native nf-hipac matches
> >>  (src/dst ip, proto, src/dst port, ttl, state, in_iface, icmp type,
> >>  tcpflags, fragments) and additionally fwmark
> > 
> > I would think for cleanliness fwmark or any metadata related
> > classification would be separate from one that is based on packet bits.
> 
> Since our classifier represents a table of rules and the rules are
> based on different matches, like src/dst ip and also fwmark (native)
> or u32 (subclassifier as match), this is definitely a clean design.
> 

I think we need to have the infrastructure in the main tc code. Its
already there - may not be very clean right now. 

> >>- the HIPAC classifier can only be attached to the HIPAC qdisc and vice
> >>  versa the HIPAC qdisc only accepts HIPAC classifiers
> > 
> > <puke> 
> > We do have an issue with being able to do extended classification
> > but building a qdisc for it is a no no. Building a qdisc that will force
> > other classifier to structure themselves after it is even a bigger sin.
> > Look at the action code i have (i can send you an updated patch); a
> > better idea is to make extended classifiers an action based on another
> > filter match. At least this is what i have been toying with and i dont
> > think it is clean enough. what we need is to extend the filtering
> > framework itself to have extended classifiers.
> 
> The new design should be much cleaner. Originally we also thought about
> making HIPAC a classifier only but we expected some problems related
> to this approach. Finally we discovered that this is not the case :)
> 

Consider what i said above. I'll try n cobble together some examples to
demonstrate (although it seems you already know this).
To allow for anyone to install classifiers-du-jour without being
dependet on hipac would be very useful. So ideas that you have for
enabling this cleanly should be moved to cls_api.

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-03 18:14     ` jamal
@ 2003-08-04 13:17       ` Michael Bellion and Thomas Heinz
  2003-08-04 15:51         ` jamal
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Bellion and Thomas Heinz @ 2003-08-04 13:17 UTC (permalink / raw)
  To: hadi; +Cc: linux-net, netdev

[-- Attachment #1: Type: text/plain, Size: 10573 bytes --]

Hi Jamal

You wrote:
> Apologies for late response. Its funny how i thought i was going to have
> more time in the last 2 weeks but due to bad scheduling that wasnt the
> case.

No problemo ...

> I think i will have to look at your code to make comments.

Curious about it.

> Not entirely accurate. Depends which tc classifier. u32 hash tables are
> infact like iptables chains.

Hm, we don't think so. Unfortunately, there does not seem to be much
information about the inner workings of u32 and we currently don't have
the time to deduce the whole algorithm from the code.

Here is a short overview of our current view on u32:
      - each u32 filter "rule" consists of possibly several u32 matches,
        i.e. tc_u32_sel with nkeys tc_u32_key's
        => one rule is basically represented as a struct tc_u_knode
      - a set of u32 filter rules with same priority is in general a
        tree of hashes like for example:
                        hash1: |--|--|
                                /   \
                hash2: |--|--|--|    hash3: |--|--|--|--|
                         |  |  |              |  |  |  |
                        r1 r2 r3             r4 r5 r6 r7
        where the r_i are in fact lists of rules (-> hashing with
        chaining)
        => if there is more than one single filter with same prio
           there is always a tree of hashes (with possibly only 1 node
           (=hash))
      - within such a tree of u32 filters (with same prio) there is
        no concept of prioritizing them any further => the rules must
        be conflict free
      - there is no way of optimizing filters with different priorities
        since one cannot assume that the intermediate classifiers are all
        of the same type

If this is not the way it _really_ works we'd appreciate it if you could
describe the general principles behind u32.

> Note, the concept of priorities which is used for conflict resolution as
> well as further separating sets of rules doesnt exist in iptables.

Well, iptables rule position and tc filter priorities are just the
same apart from the fact that iptables does not allow multiple rules
to have the same priority (read: position). Therefore iptables rulesets
don't suffer from conflicts.

> You can also have them use different priorities and with the continue
> operator first clasify based on packet data then on metadata or on
> another packet header filter.

Ok but then you fall back to the linear approach. Since with u32 only
blocks of rules with same prio can be optimized one has to implement a
ruleset using as few different prioritized blocks of filters as possible
to achieve maximum performance.

>>One disadvantage of this concept is that the hashed filters
>>must be compact, i.e. there cannot be other classifiers in between.
> 
> I didnt understand this. Are you talking about conflict resolving of
> overlapping filters?

No, the issue is just that within a block of filters with same prio
there cannot be another type of filter, e.g. one cannot put a route
classifier inside a hash of u32 classifiers.

>>Another major disadvantage is caused by the hashing scheme.
>>If you use the hash for 1 dimension you have to make sure that
>>either all filters in a certain bucket are disjoint or you must have
>>an implicit ordering of the rules (according to the insertion order
>>or something). This scheme is not extendable to 2 or more dimensions,
>>i.e. 1 hash for src ip, #(src ip buckets) many dst ip hashes and so
>>on, because you simply cannot express arbitrary rulesets.
> 
> If i understood you - you are refering to a way to reduce the number of
> lookups by having disjoint hashes. I suppose for something as simple as 
> a five tuple lookup, this is almost solvable by hardcoding the different
> fields into multiway hashes. Its when you try to generalize that it
> becomes an issue. 

What do you mean exactly by "five tuple"? Do you refer to rules which
consist of 5 punctiform matches, i.e. no masks or ranges or wildcards
allowed, like (src ip 1.2.3.4, dst ip 3.4.5.6, proto tcp, src port 123,
dst port 456)?

Of course the scheme works for such cases (which consist of
non-conflicting rules) although the user must be aware of the
concrete hash function: divisor & u32_hash_fold(key, sel)
because the mask would be 0xffffffff for the ip's.

If ranges or overlapping masks are involved it gets really complicated
and we doubt that people are able to manage such scenarios.

>>Another general problem is of course that the user has to manually
>>setup the hash which is rather inconvenient.
> 
> Yes. Take a look at Werners tcng - he has a clever way to hide things
> from the user. I did experimentation on u32 with a kernel thread which
> rearranged things when they seemed out of balance but i havent
> experimented with a lot of rules.

We had a look at the tcng paper. Here it says that the u32 classifier
is not used in the optimal way. Since we didn't have a look at the
current tcng release it might well be that these problems are already
addressed. Is that the case?

BTW, why do you want to rearrange the tree of hashes and based on which
heuristics? Why is there a kernel thread needed? Isn't it possible to
arrange the tree directly after insert/delete operations?

>>Now, what are the implications on the matching performance:
>>tc vs. nf-hipac? As long as the extended hashing stuff is not used
>>nf-hipac is clearly superior to tc. 
> 
> You are refering to u32. You mean as long as u32 stored things in a
> single linked list, you win - correct?

Yes, but this is not only true for u32. As long as the ruleset
looks like: "n filters with n different priorities which can
be translated into n nf-hipac rules" nf-hipac is clearly faster
because in this case tc uses the linear approach.

>>When hashing is used it _really_
>>depends. If there is only one classifier (with hashing) per interface
>>and the number of rules per bucket is very small the performance should
>>be comparable. As soon as you add other classifiers nf-hipac will
>>outperform tc again.
> 
> If we take a simple user interface abstraction like tcng which hides the
> evil of u32 and then take simple 5 tuple rules - i doubt you will see
> any difference. For more generic setup, the kernel thread i refer to
> would work - but may slow insertion. 

For the simple punctiform examples like described above you may be right
that nf-hipac and tc should perform similar but it's not clear to us
how you want to achieve universality (including mask, ranges and
wildcards) by this kernel thread rearranging approach. Basically you
have to address the following problem: Given an arbitrary set of u32
rules with different priorities you have to compute an semantically
equivalent representation with a tree of hashes.

>>So, basically HIPAC is just a normal classifier like any other
>>with two exceptions:
>>   a) it can occur only once per interface
>>   b) the rules within the classifier can contain other classifiers,
>>      e.g. u32, fw, tc_index, as matches
> 
> But why restriction a)? 

Well, the restriction is necessary because of the new hipac design in
which nf-hipac (i.e. firewalling), routing and cls_hipac (i.e. tc) are
just applications for the classification framework. The basic idea is
to reduce the number of classifications on the forwarding path to a
single one (in the best case). In order to truly understand the
requirement it would be necessary to explain the idea behind the new
stage concept which is beyond the scope of this e-mail :-/.

> Also why should we need hipac to hold other filters when the
> infrastructure itself can hold the extended filters just fine?
> I think you may actually be trying to say why somewhere in the email,
> but it must not be making a significant impression on my brain.

The idea is to reduce the embedded classifiers to a match, i.e.
their return value is ignored. This offers the possibility of
expressing a conjunction of native matches and classifiers in the
very same way nf-hipac rules support iptables matches. This enhances
the expressiveness of classification rules.
A rule |nat. match 1|...|nat. match n|emb. cls 1|...|emb. cls m|
matches if nat. match 1-n and emb. cls 1-m match.

>>There is just one problem with the current tc framework. Once
>>a new filter is inserted into the chain it is not removed even
>>if the change function of the classifier returns < 0
>>(2.6.0-test1: net/sched/cls_api.c: line 280f).
>>This should be changed anyway, shouldn't it?
> 
> Are you refering to this piece of code?:
> ----
>         err = tp->ops->change(tp, cl, t->tcm_handle, tca, &fh);
>         if (err == 0)
>                 tfilter_notify(skb, n, tp, fh, RTM_NEWTFILTER);
> 
> errout:
>         if (cl)
>                 cops->put(q, cl);
>         return err;
> ---

Yes.

> change() should not return <0 if it has installed the filter i think.
> Should the top level code be responsible for removing filters?

The top level code (cls_hipac.c:tc_ctl_filter) is responsible for
creating new tcf_proto structs (if not existent) and enqueuing the
struct into the chain. Therefore it is also responsible for taking
the stuff out of the chain again if necessary. In case we have just
created a new tcf_proto and change fails it would be better if the new
tcf_proto is removed afterwards, i.e.
                     write_lock(&qdisc_tree_lock);
                     spin_lock_bh(&dev->queue_lock);
                     *back = tp->next;
                     spin_unlock_bh(&dev->queue_lock);
                     write_unlock(&qdisc_tree_lock);
                     tp->ops->destroy(tp);
                     module_put(tp->ops->owner);
                     kfree(tp);
is issued.
Do you agree?

> Consider what i said above. I'll try n cobble together some examples to
> demonstrate (although it seems you already know this).
> To allow for anyone to install classifiers-du-jour without being
> dependet on hipac would be very useful. So ideas that you have for
> enabling this cleanly should be moved to cls_api.

Nobody will be forced to use hipac :-). It's just another classifier
like u32. We don't even had to modify cls_api so far. Everything
integrates just fine.

Regards,

+-----------------------+----------------------+
|   Michael Bellion     |     Thomas Heinz     |
| <mbellion@hipac.org>  |  <creatix@hipac.org> |
+-----------------------+----------------------+
|    High Performance Packet Classification    |
|       nf-hipac: http://www.hipac.org/        |
+----------------------------------------------+

[-- Attachment #2: Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-04 13:17       ` Michael Bellion and Thomas Heinz
@ 2003-08-04 15:51         ` jamal
  2003-08-05 22:21           ` Michael Bellion and Thomas Heinz
  0 siblings, 1 reply; 24+ messages in thread
From: jamal @ 2003-08-04 15:51 UTC (permalink / raw)
  To: Michael Bellion and Thomas Heinz; +Cc: linux-net, netdev

Olla,

On Mon, 2003-08-04 at 09:17, Michael Bellion and Thomas Heinz wrote:

> > I think i will have to look at your code to make comments.
> 
> Curious about it.
> 

I promise i will. I dont think i will do it justice spending 5 minutes
on it. I take it you have written extensive docs too ;->

> > Not entirely accurate. Depends which tc classifier. u32 hash tables are
> > infact like iptables chains.
> 
> Hm, we don't think so. Unfortunately, there does not seem to be much
> information about the inner workings of u32 and we currently don't have
> the time to deduce the whole algorithm from the code.
> 

Unfortunately it is more exciting to write code than documents. I almost
got someone to document at least its proper usage but they backed away
at the last minute.

> Here is a short overview of our current view on u32:
>       - each u32 filter "rule" consists of possibly several u32 matches,
>         i.e. tc_u32_sel with nkeys tc_u32_key's
>         => one rule is basically represented as a struct tc_u_knode
>       - a set of u32 filter rules with same priority is in general a
>         tree of hashes like for example:
>                         hash1: |--|--|
>                                 /   \
>                 hash2: |--|--|--|    hash3: |--|--|--|--|
>                          |  |  |              |  |  |  |
>                         r1 r2 r3             r4 r5 r6 r7
>         where the r_i are in fact lists of rules (-> hashing with
>         chaining)
>         => if there is more than one single filter with same prio
>            there is always a tree of hashes (with possibly only 1 node
>            (=hash))
>       - within such a tree of u32 filters (with same prio) there is
>         no concept of prioritizing them any further => the rules must
>         be conflict free
>       - there is no way of optimizing filters with different priorities
>         since one cannot assume that the intermediate classifiers are all
>         of the same type
> 
> If this is not the way it _really_ works we'd appreciate it if you could
> describe the general principles behind u32.
> 

u32 is a swiss knife so to go into general principles requires some
time, motivation, and more importantly patience.I possess none of these
nice attributes at the moment. You are doing a good job keep reading the
code.
I dont wanna go in a lot of details, but one important detail is that
keynodes can also lead to other hash tables. So you can split the packet
parsing across multiple hashes - this is where the comparison with
chains comes in. There are several ways to do this. I'll show you the
brute force way and you can make it more usable with "hashkey" and
"sample" operator. Stealing from your example:

                        hash1: |--|--|
                                /   
                hash2: |--|--|--|   
                         |  |  |             
                        r1 r2 r3 
                         |     |
                        hash3  hash4
                                |  | 
                                r4 r5  
 
Example, you go into hash2 for all IP packets. The rules on the hash2
look at the protocol type and select a different hash table for TCP,
UDP, ICMP etc.

- so general rules is: Put your most hit rules at the highest priority
so they are found first.
Heres an example, i havent tested this (i can send you a tested example
if you cant get this to work):

-------
TCF=tc filter add dev eth0 parent ffff: protocol ip prio 10

# add hash table 1
$TCF handle 1::: u32 divisor 1
#add hash table 2
$TCF handle 2::: u32 divisor 1
#add your filter rules to specific tables: ICMP to table 1, TCP to table
#6 etc
.
.
#ICMP gets matched in table 1
$TCF match ip protocol 1 0xff link 1:0:0
.
.
----------

Makes sense?

Note, this doesnt say much about the user usability of u32 - it just
says can be done.

> > Note, the concept of priorities which is used for conflict resolution as
> > well as further separating sets of rules doesnt exist in iptables.
> 
> Well, iptables rule position and tc filter priorities are just the
> same apart from the fact that iptables does not allow multiple rules
> to have the same priority (read: position). Therefore iptables rulesets
> don't suffer from conflicts.
> 

sure position could be used as a priority. It is easier/intuitive to
just have explicit priorities.

> > You can also have them use different priorities and with the continue
> > operator first clasify based on packet data then on metadata or on
> > another packet header filter.
> 
> Ok but then you fall back to the linear approach. Since with u32 only
> blocks of rules with same prio can be optimized one has to implement a
> ruleset using as few different prioritized blocks of filters as possible
> to achieve maximum performance.
> 

Read what i said above if you still hold the same opinion lets discuss.
What "optimizes" could be a user interface or the thread i was talking
about earlier.

> >>One disadvantage of this concept is that the hashed filters
> >>must be compact, i.e. there cannot be other classifiers in between.
> > 
> > I didnt understand this. Are you talking about conflict resolving of
> > overlapping filters?
> 
> No, the issue is just that within a block of filters with same prio
> there cannot be another type of filter, e.g. one cannot put a route
> classifier inside a hash of u32 classifiers.
> 

But you dont need to as i was pointing out earlier. You can have both
fwmark,tcindex,u32, rsvp etc being invoked one after the other.

> >>Another major disadvantage is caused by the hashing scheme.
> >>If you use the hash for 1 dimension you have to make sure that
> >>either all filters in a certain bucket are disjoint or you must have
> >>an implicit ordering of the rules (according to the insertion order
> >>or something). This scheme is not extendable to 2 or more dimensions,
> >>i.e. 1 hash for src ip, #(src ip buckets) many dst ip hashes and so
> >>on, because you simply cannot express arbitrary rulesets.
> > 
> > If i understood you - you are refering to a way to reduce the number of
> > lookups by having disjoint hashes. I suppose for something as simple as 
> > a five tuple lookup, this is almost solvable by hardcoding the different
> > fields into multiway hashes. Its when you try to generalize that it
> > becomes an issue. 
> 
> What do you mean exactly by "five tuple"? Do you refer to rules which
> consist of 5 punctiform matches, i.e. no masks or ranges or wildcards
> allowed, like (src ip 1.2.3.4, dst ip 3.4.5.6, proto tcp, src port 123,
> dst port 456)?
> 

above but with masks. "5 tuple" is a classical name for the above.

> Of course the scheme works for such cases (which consist of
> non-conflicting rules) although the user must be aware of the
> concrete hash function: divisor & u32_hash_fold(key, sel)
> because the mask would be 0xffffffff for the ip's.
> 
> If ranges or overlapping masks are involved it gets really complicated
> and we doubt that people are able to manage such scenarios.
> 

I was refering to the cascaded hash tables i was refering to earlier.
Depending on the rules, you could optimize differently.

> >>Another general problem is of course that the user has to manually
> >>setup the hash which is rather inconvenient.
> > 
> > Yes. Take a look at Werners tcng - he has a clever way to hide things
> > from the user. I did experimentation on u32 with a kernel thread which
> > rearranged things when they seemed out of balance but i havent
> > experimented with a lot of rules.
> 
> We had a look at the tcng paper. Here it says that the u32 classifier
> is not used in the optimal way. Since we didn't have a look at the
> current tcng release it might well be that these problems are already
> addressed. Is that the case?
> 

He doesnt fix the u32, rather if you use his wrappers he outputs
optimized u32 rules. All that is hidden from the user.

> BTW, why do you want to rearrange the tree of hashes and based on which
> heuristics? Why is there a kernel thread needed? Isn't it possible to
> arrange the tree directly after insert/delete operations?
> 

You can do that, but then you are adding delay to the insertion/deletion
rates which are very important metrics.
Another way to do it is to fire a netlink message every time a hash
table's keynodes exceed a threshold value and have user space compute a
rearrangement.
Essentially you have to weigh your tradeoffs. 

> >>Now, what are the implications on the matching performance:
> >>tc vs. nf-hipac? As long as the extended hashing stuff is not used
> >>nf-hipac is clearly superior to tc. 
> > 
> > You are refering to u32. You mean as long as u32 stored things in a
> > single linked list, you win - correct?
> 
> Yes, but this is not only true for u32. As long as the ruleset
> looks like: "n filters with n different priorities which can
> be translated into n nf-hipac rules" nf-hipac is clearly faster
> because in this case tc uses the linear approach.
> 

If you still hold this opinion after my explanation on cascaded hash
tables, then lets discuss again.

> >>When hashing is used it _really_
> >>depends. If there is only one classifier (with hashing) per interface
> >>and the number of rules per bucket is very small the performance should
> >>be comparable. As soon as you add other classifiers nf-hipac will
> >>outperform tc again.
> > 
> > If we take a simple user interface abstraction like tcng which hides the
> > evil of u32 and then take simple 5 tuple rules - i doubt you will see
> > any difference. For more generic setup, the kernel thread i refer to
> > would work - but may slow insertion. 
> 
> For the simple punctiform examples like described above you may be right
> that nf-hipac and tc should perform similar but it's not clear to us
> how you want to achieve universality (including mask, ranges and
> wildcards) by this kernel thread rearranging approach. Basically you
> have to address the following problem: Given an arbitrary set of u32
> rules with different priorities you have to compute an semantically
> equivalent representation with a tree of hashes.
> 

yes - that is the challenge to resolve;->

> >>So, basically HIPAC is just a normal classifier like any other
> >>with two exceptions:
> >>   a) it can occur only once per interface
> >>   b) the rules within the classifier can contain other classifiers,
> >>      e.g. u32, fw, tc_index, as matches
> > 
> > But why restriction a)? 
> 
> Well, the restriction is necessary because of the new hipac design in
> which nf-hipac (i.e. firewalling), routing and cls_hipac (i.e. tc) are
> just applications for the classification framework. The basic idea is
> to reduce the number of classifications on the forwarding path to a
> single one (in the best case). In order to truly understand the
> requirement it would be necessary to explain the idea behind the new
> stage concept which is beyond the scope of this e-mail :-/.
> 

Ok - maybe when you explain the concept later i will get it. Is your
plan to put this in other places other than Linux?

> > Also why should we need hipac to hold other filters when the
> > infrastructure itself can hold the extended filters just fine?
> > I think you may actually be trying to say why somewhere in the email,
> > but it must not be making a significant impression on my brain.
> 
> The idea is to reduce the embedded classifiers to a match, i.e.
> their return value is ignored. This offers the possibility of
> expressing a conjunction of native matches and classifiers in the
> very same way nf-hipac rules support iptables matches. This enhances
> the expressiveness of classification rules.
> A rule |nat. match 1|...|nat. match n|emb. cls 1|...|emb. cls m|
> matches if nat. match 1-n and emb. cls 1-m match.
> 

So you got this thought from iptables and took it to the next level?
I am still not sure i understand why not use what already exists - but
i'll just say i dont see it right now.


> 
> The top level code (cls_hipac.c:tc_ctl_filter) is responsible for
> creating new tcf_proto structs (if not existent) and enqueuing the
> struct into the chain. Therefore it is also responsible for taking
> the stuff out of the chain again if necessary. In case we have just
> created a new tcf_proto and change fails it would be better if the new
> tcf_proto is removed afterwards, i.e.
>                      write_lock(&qdisc_tree_lock);
>                      spin_lock_bh(&dev->queue_lock);
>                      *back = tp->next;
>                      spin_unlock_bh(&dev->queue_lock);
>                      write_unlock(&qdisc_tree_lock);
>                      tp->ops->destroy(tp);
>                      module_put(tp->ops->owner);
>                      kfree(tp);
> is issued.
> Do you agree?
> 

It doesnt appear harmful to leave it there without destroying it.
The next time someome adds a filter of the same protocol + priority, it
will already exist. If you want to be accurate (because it does get
destroyed when the init() fails), then destroy it but you need to put
checks for "incase we have added a new tcf_proto" which may not look
pretty. Is this causing you some discomfort?

> > Consider what i said above. I'll try n cobble together some examples to
> > demonstrate (although it seems you already know this).
> > To allow for anyone to install classifiers-du-jour without being
> > dependet on hipac would be very useful. So ideas that you have for
> > enabling this cleanly should be moved to cls_api.
> 
> Nobody will be forced to use hipac :-). It's just another classifier
> like u32. We don't even had to modify cls_api so far. Everything
> integrates just fine.
> 

cool. Keep up the good work.

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-04 15:51         ` jamal
@ 2003-08-05 22:21           ` Michael Bellion and Thomas Heinz
  2003-08-07 19:58             ` jamal
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Bellion and Thomas Heinz @ 2003-08-05 22:21 UTC (permalink / raw)
  To: hadi; +Cc: linux-net, netdev

[-- Attachment #1: Type: text/plain, Size: 6117 bytes --]

Hi Jamal

You wrote:
> I promise i will. I dont think i will do it justice spending 5 minutes
> on it. I take it you have written extensive docs too ;->

Of course ;-)
Well, actually we are going to present an overview of the hipac
algorithm at the netfilter developer workshop in Budapest.
Hope to see you there.

> Unfortunately it is more exciting to write code than documents. I almost
> got someone to document at least its proper usage but they backed away
> at the last minute.

lol

> I dont wanna go in a lot of details, but one important detail is that
> keynodes can also lead to other hash tables. So you can split the packet
> parsing across multiple hashes - this is where the comparison with
> chains comes in. There are several ways to do this. I'll show you the
> brute force way and you can make it more usable with "hashkey" and
> "sample" operator. Stealing from your example:
> 
> [example snipped]
> 
> Makes sense?

Yes, it does. Still the question is how to solve this
generally. Consider the following example ruleset:

1) src ip 10.0.0.0/30 dst ip 20.0.0.0/20
2) src ip 10.0.0.0/28 dst ip 20.0.0.0/22
3) src ip 10.0.0.0/26 dst ip 20.0.0.0/24
4) src ip 10.0.0.0/24 dst ip 20.0.0.0/26
5) src ip 10.0.0.0/22 dst ip 20.0.0.0/28
6) src ip 10.0.0.0/20 dst ip 20.0.0.0/30

So you have 1 src ip hash and #buckets(src ip hash) many
dst ip hashes. In order to achieve maximum performance
you have to minimize the number of collisions in the
hash buckets. How would you choose the hash function
and what would the construction look like?

In principle the tree of hashes approach is capable to
express a general access list like ruleset, i.e. a set
of terminal rules with different priorities. The problem
is that the approach is only efficient if the number of
collisions is O(1) -> no amortized analysis but rather
per bucket.

In theory you can do the following. Let's consider one
dimension. The matches in one dimension form a set of
elementary intervals which are overlapped by certain rules.
Example:

          |------|                      |---------|
      |----------------|
             |------------------|
                           |---------------|

|----|---|--|---|-----|---|----|-------|--|------|-------|

The '|-----|' reflect the matches and the bottom line
represents the set of elementary intervals introduced
by the matches. Now, you can decide for each elementary
interval which rule matches since the rules are terminal
and uniquely prioritized.

The next step would be to create a hash with #elementary
intervals many buckets and create a hash function which
maps the keys to the appropriate buckets like in the picture.
In this case you have exactly 1 entry per hash bucket.
Sounds fine BUT it is not possible to generically deduce
an easily (= fast) computable hash function with the
described requirements.

BTW, this approach can be extended to 2 or more dimensions
where the hash function for each hash has to meet the
requirement. Of course this information is not very helpful
since the problem of defining appropriate hash functions
remains ;)

Obviously this way is not viable but supposedly the only
one to achieve ultimate performance with the tree of hashes
concept.

BTW, the way hipac works is basically not so different
from the idea described above but since we use efficient
btrees we don't have to define hash functions.

> sure position could be used as a priority. It is easier/intuitive to
> just have explicit priorities.

Merely a matter of taste. The way iptables and nf-hipac use
priorities is somewhat more dynamic than the tc way because
they are automatically adjusted if a rule is inserted in between
others.

> What "optimizes" could be a user interface or the thread i was talking
> about earlier.

Hm, this rebalancing is not clear to us. Do you want to rebalance
the tree of hashes? This seems a little strange at the first
glance because the performance of the tree of hashes is dominated
by the number of collisions that need to be resolved and
not the depth of the tree.

> Is your plan to put this in other places other than Linux?

Currently we are working on the integration in linux.
In general the hipac core is OS and application independent,
so basically it could also be used for some userspace program
which is related to classification and of course in other OS's.

Any special reason why you are asking this question?

> So you got this thought from iptables and took it to the next level?

Well, in order to support iptables matches and targets we had
to create an appropriate abstraction for them on the hipac
layer. This abstraction can also be used for tc classifiers
if the tcf_result is ignored, i.e. you just consider whether
the filter matched or not.

> I am still not sure i understand why not use what already exists - but
> i'll just say i dont see it right now.

If hipac had no support for embedded classifiers you couldn't
express a ruleset like:
1) [native hipac matches] [u32 filter] [classid]
2) [native hipac matches] [classid]
You would have to construct rule 1) in a way that it "jumps"
to an external u32 filter. Unfortunately, you cannot jump
back to the hipac filter again in case the u32 filter does
not match so rule 2) is unreachable. This problem is caused
by the fact that cls_hipac can occur at most once per interface.

> It doesnt appear harmful to leave it there without destroying it.
> The next time someome adds a filter of the same protocol + priority, it
> will already exist. If you want to be accurate (because it does get
> destroyed when the init() fails), then destroy it but you need to put
> checks for "incase we have added a new tcf_proto" which may not look
> pretty. Is this causing you some discomfort?

No, actually not.

Regards,

+-----------------------+----------------------+
|   Michael Bellion     |     Thomas Heinz     |
| <mbellion@hipac.org>  |  <creatix@hipac.org> |
+-----------------------+----------------------+
|    High Performance Packet Classification    |
|       nf-hipac: http://www.hipac.org/        |
+----------------------------------------------+

[-- Attachment #2: Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-05 22:21           ` Michael Bellion and Thomas Heinz
@ 2003-08-07 19:58             ` jamal
  2003-08-07 20:05               ` David S. Miller
  2003-08-11 22:39               ` Michael Bellion and Thomas Heinz
  0 siblings, 2 replies; 24+ messages in thread
From: jamal @ 2003-08-07 19:58 UTC (permalink / raw)
  To: Michael Bellion and Thomas Heinz; +Cc: linux-net, netdev

Hi there,

On Tue, 2003-08-05 at 18:21, Michael Bellion and Thomas Heinz wrote:
> Hi Jamal
> 
> You wrote:
> > I promise i will. I dont think i will do it justice spending 5 minutes
> > on it. I take it you have written extensive docs too ;->
> 
> Of course ;-)
> Well, actually we are going to present an overview of the hipac
> algorithm at the netfilter developer workshop in Budapest.
> Hope to see you there.
> 

Unfortunately due to economical reasons i wont be able to make it. I
mentioned it to LaForge.

> > Unfortunately it is more exciting to write code than documents. I almost
> > got someone to document at least its proper usage but they backed away
> > at the last minute.
> 
> lol
> 

It was very close ;-> The guy looked motivated i felt scared for a while
that he will be asking a lot of questions. Then i never heard about it
again ;-> I think he left town too. 

> Yes, it does. Still the question is how to solve this
> generally. Consider the following example ruleset:
> 
> 1) src ip 10.0.0.0/30 dst ip 20.0.0.0/20
> 2) src ip 10.0.0.0/28 dst ip 20.0.0.0/22
> 3) src ip 10.0.0.0/26 dst ip 20.0.0.0/24
> 4) src ip 10.0.0.0/24 dst ip 20.0.0.0/26
> 5) src ip 10.0.0.0/22 dst ip 20.0.0.0/28
> 6) src ip 10.0.0.0/20 dst ip 20.0.0.0/30
> 
> So you have 1 src ip hash and #buckets(src ip hash) many
> dst ip hashes. In order to achieve maximum performance
> you have to minimize the number of collisions in the
> hash buckets. How would you choose the hash function
> and what would the construction look like?
> 

It can be done by using the masks - but it would look really ugly. I
suppose just providing a good user interface is valuable.

> In principle the tree of hashes approach is capable to
> express a general access list like ruleset, i.e. a set
> of terminal rules with different priorities. The problem
> is that the approach is only efficient if the number of
> collisions is O(1) -> no amortized analysis but rather
> per bucket.
> 

true.

> In theory you can do the following. Let's consider one
> dimension. The matches in one dimension form a set of
> elementary intervals which are overlapped by certain rules.
> Example:
> 
>           |------|                      |---------|
>       |----------------|
>              |------------------|
>                            |---------------|
> 
> |----|---|--|---|-----|---|----|-------|--|------|-------|
> 
> The '|-----|' reflect the matches and the bottom line
> represents the set of elementary intervals introduced
> by the matches. Now, you can decide for each elementary
> interval which rule matches since the rules are terminal
> and uniquely prioritized.
> 

right. Why do you refer to this as one dimension?

> The next step would be to create a hash with #elementary
> intervals many buckets and create a hash function which
> maps the keys to the appropriate buckets like in the picture.
> In this case you have exactly 1 entry per hash bucket.
> Sounds fine BUT it is not possible to generically deduce
> an easily (= fast) computable hash function with the
> described requirements.
> 

nod.

> BTW, this approach can be extended to 2 or more dimensions
> where the hash function for each hash has to meet the
> requirement. Of course this information is not very helpful
> since the problem of defining appropriate hash functions
> remains ;)
> 

Is that problem even solvable?

> Obviously this way is not viable but supposedly the only
> one to achieve ultimate performance with the tree of hashes
> concept.
> 
> BTW, the way hipac works is basically not so different
> from the idea described above but since we use efficient
> btrees we don't have to define hash functions.
> 

This is why i was wondering how fast your instertions and deletions are.
Seems to me you will have to sort the rules everytime.

> > sure position could be used as a priority. It is easier/intuitive to
> > just have explicit priorities.
> 
> Merely a matter of taste. The way iptables and nf-hipac use
> priorities is somewhat more dynamic than the tc way because
> they are automatically adjusted if a rule is inserted in between
> others.
> 

Dont you think this "dynamic adjustment" is unnecessary. Essentially you
enforce a model where every rule is a different priority.

> > What "optimizes" could be a user interface or the thread i was talking
> > about earlier.
> 
> Hm, this rebalancing is not clear to us. Do you want to rebalance
> the tree of hashes? This seems a little strange at the first
> glance because the performance of the tree of hashes is dominated
> by the number of collisions that need to be resolved and
> not the depth of the tree.
> 

The general idea is to recreate the tree if need be based on colisions.
I just hope some idiot reading this doesnt go and patent it(has happened
before). Think of it as dynamic hash adjustment. Talk to me in private
if you are really interested.

> > Is your plan to put this in other places other than Linux?
> 
> Currently we are working on the integration in linux.
> In general the hipac core is OS and application independent,
> so basically it could also be used for some userspace program
> which is related to classification and of course in other OS's.
> 
> Any special reason why you are asking this question?
> 

I was wondering why not just optimize it for Linux. You are trying to
center the world around nf-hipac - I would just center it around 
Linux ;->

> > So you got this thought from iptables and took it to the next level?
> 
> Well, in order to support iptables matches and targets we had
> to create an appropriate abstraction for them on the hipac
> layer. This abstraction can also be used for tc classifiers
> if the tcf_result is ignored, i.e. you just consider whether
> the filter matched or not.
> 

I am not sure i understood the part about ignoring tcf_result.

> > I am still not sure i understand why not use what already exists - but
> > i'll just say i dont see it right now.
> 
> If hipac had no support for embedded classifiers you couldn't
> express a ruleset like:
> 1) [native hipac matches] [u32 filter] [classid]
> 2) [native hipac matches] [classid]
> You would have to construct rule 1) in a way that it "jumps"
> to an external u32 filter. Unfortunately, you cannot jump
> back to the hipac filter again in case the u32 filter does
> not match so rule 2) is unreachable. This problem is caused
> by the fact that cls_hipac can occur at most once per interface.
> 

You show only one classid per rule .. I think i can see what you meant
by ignoring tcf_result - essentially you want to have a series of filter
rules with different classification enngines, no? But could you not have
the filters repeat the same classid for every filter? 
Also it seems you want to be able to have an action defined for
"nomatch" as well as "match" - is that correct? Some form of
reclassification when nomatch ?

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-07 19:58             ` jamal
@ 2003-08-07 20:05               ` David S. Miller
  2003-08-08 21:51                 ` jamal
  2003-08-11 22:39               ` Michael Bellion and Thomas Heinz
  1 sibling, 1 reply; 24+ messages in thread
From: David S. Miller @ 2003-08-07 20:05 UTC (permalink / raw)
  To: hadi; +Cc: nf, linux-net, netdev

On 07 Aug 2003 15:58:51 -0400
jamal <hadi@cyberus.ca> wrote:

> > Yes, it does. Still the question is how to solve this
> > generally. Consider the following example ruleset:
> > 
> > 1) src ip 10.0.0.0/30 dst ip 20.0.0.0/20
> > 2) src ip 10.0.0.0/28 dst ip 20.0.0.0/22
> > 3) src ip 10.0.0.0/26 dst ip 20.0.0.0/24
> > 4) src ip 10.0.0.0/24 dst ip 20.0.0.0/26
> > 5) src ip 10.0.0.0/22 dst ip 20.0.0.0/28
> > 6) src ip 10.0.0.0/20 dst ip 20.0.0.0/30
> > 
> > So you have 1 src ip hash and #buckets(src ip hash) many
> > dst ip hashes. In order to achieve maximum performance
> > you have to minimize the number of collisions in the
> > hash buckets. How would you choose the hash function
> > and what would the construction look like?
> > 
> 
> It can be done by using the masks - but it would look really ugly. I
> suppose just providing a good user interface is valuable.

If you input all the keys into the Jenkins hash, how does
it perform?  Has anyone even tried that and compared it
to all of these fancy multi-level tree like hash things?

I think Jenkins would work very well for exactly this kind
of application.  And it's fully available to the entire kernel
via linux/jhash.h and already in use by other things such
as the routing cache and the netfilter conntrack code.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-07 20:05               ` David S. Miller
@ 2003-08-08 21:51                 ` jamal
  2003-08-09  0:01                   ` David S. Miller
  0 siblings, 1 reply; 24+ messages in thread
From: jamal @ 2003-08-08 21:51 UTC (permalink / raw)
  To: David S. Miller; +Cc: nf, linux-net, netdev

On Thu, 2003-08-07 at 16:05, David S. Miller wrote: 
> On 07 Aug 2003 15:58:51 -0400
> jamal <hadi@cyberus.ca> wrote:

> If you input all the keys into the Jenkins hash, how does
> it perform?  Has anyone even tried that and compared it
> to all of these fancy multi-level tree like hash things?

AFAIK, noone has tried it. I will try out at some point. 

> I think Jenkins would work very well for exactly this kind
> of application.  And it's fully available to the entire kernel
> via linux/jhash.h and already in use by other things such
> as the routing cache and the netfilter conntrack code.

A good reason for the multilevel stuff is to support arbitrary packet 
offsets i.e you dont know which bits in the packet you are interested in
ahead of time. Its easy to use hashes when you know that you need to
find example ip src/dst. Since its in the kernel I will look into 
it - but has to meet the arbitrary offset requirement.

cheers,
jamal



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-08 21:51                 ` jamal
@ 2003-08-09  0:01                   ` David S. Miller
  0 siblings, 0 replies; 24+ messages in thread
From: David S. Miller @ 2003-08-09  0:01 UTC (permalink / raw)
  To: hadi; +Cc: nf, linux-net, netdev

On 08 Aug 2003 17:51:40 -0400
jamal <hadi@cyberus.ca> wrote:

> Its easy to use hashes when you know that you need to
> find example ip src/dst.

Jenkins is rather agnostic about the input bits, that's
what makes it so powerful.  It performs about as well
for random input as it does for input which has various
patterns.

Wait, are you saying the input key size can change?  Yes,
that's an interesting problem.

But for things where you always want some 96-bit key,
Jenkins is probably best.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-07 19:58             ` jamal
  2003-08-07 20:05               ` David S. Miller
@ 2003-08-11 22:39               ` Michael Bellion and Thomas Heinz
  2003-08-12  2:57                 ` jamal
  2003-08-12  5:40                 ` David S. Miller
  1 sibling, 2 replies; 24+ messages in thread
From: Michael Bellion and Thomas Heinz @ 2003-08-11 22:39 UTC (permalink / raw)
  To: hadi; +Cc: linux-net, netdev

[-- Attachment #1: Type: text/plain, Size: 9529 bytes --]

Hi Jamal

Sorry for the late reply.

You wrote:
> Unfortunately due to economical reasons i wont be able to make it. I
> mentioned it to LaForge.

It's a pity. Anyway, we should keep in touch to discuss the
future directions of hipac, tc and stuff. We are going to
post the new hipac design as soon as it is sufficiently
elaborated.

> It was very close ;-> The guy looked motivated i felt scared for a while
> that he will be asking a lot of questions. Then i never heard about it
> again ;-> I think he left town too. 

Too bad, you should have told the guy about all the fame
and glory ;-)

>>Yes, it does. Still the question is how to solve this
>>generally. Consider the following example ruleset:
>>
>>1) src ip 10.0.0.0/30 dst ip 20.0.0.0/20
>>2) src ip 10.0.0.0/28 dst ip 20.0.0.0/22
>>3) src ip 10.0.0.0/26 dst ip 20.0.0.0/24
>>4) src ip 10.0.0.0/24 dst ip 20.0.0.0/26
>>5) src ip 10.0.0.0/22 dst ip 20.0.0.0/28
>>6) src ip 10.0.0.0/20 dst ip 20.0.0.0/30
> 
> It can be done by using the masks - but it would look really ugly. I
> suppose just providing a good user interface is valuable.

Well, let's discuss the problem in a more general way. First of
all we should clearly agree on the goal. For simplicity we focus
on the 1 dimensional case (e.g. src ip, see also next paragraph
for a explanation of the term "dimension" in that context) and we
allow only prefixes (no general masks).

Problem: We are given an access list of n 1-dimensional prefix rules,
          i.e. (v_1/p_1, ..., v_n/p_n) where v_i is the value
          (e.g. src ip) and p_1 is the length of the prefix.
          [The target of the rules can be ignored since it is
           irrelevant for the considerations.]
          The priority of each rule is determined by its position
          in the tuple (-> smaller value means higher priority).
          Furthermore we assume: p_i >= p_j for all i <= j to avoid
          never matching rules.
          Now, we want to construct a hash with m = O(n) buckets
          where each bucket contains O(1) rules. Therefore we need
          to find a prefix length PREF and a hash function HASH
          where given a key k the number of the bucket is computed
          via: HASH(k & ~(~0 >> PREF))

Let's take a look at the elementary intervals (see our last e-mail)
created by the ruleset. There are at most 2 * n + 1 elementary
intervals and each interval can be described in the form of a
prefix: eval/epref where eval is the value and epref the length of
the prefix.

 From the description above we know that p_1 =: p_max is the maximal
prefix length and p_n := p_min is the minimal prefix length of the
ruleset. We can conclude that epref_min >= p_min and epref_max = p_max
where epref_min is the minimal prefix length of all elementary
intervals (epref_max analog).

Now to the main question: How to choose PREF? Obviously, we would
choose it such that epref_min <= PREF <= epref_max.
Well, here we are in trouble because if you choose PREF < epref_max
the number of elements per bucket is in the worst case:
  Sum[i := 1 to epref_max - PREF] min(2^(PREF + i), count(PREF + i))
  where count(j) := number of rules with prefix length j

So, we have to choose PREF very close to epref_max or even epref_max
if we want to have O(1) rules per bucket. Unfortunately, if we're
doing this we run into another problem. Assume PREF = epref_max.
Since we allow O(n) buckets we can assign all rules with prefix length
PREF to separate buckets -> no collisions. But what about
the remaining rules with prefix lengths < PREF? Those prefixes
are distributed over multiple buckets which breaks the O(1)
requirement for buckets. The reason is as follows.
We know that each of the remaining n - count(PREF) rules terminate
in at least 1 elementary interval. The elementary interval of a
rule with prefix length p contains at most 2^p keys. It is distributed
over 2^(PREF - p) buckets (at most).
Therefore we have at most:
      N := Sum[i := p_min to PREF - 1] count(i) * 2^(PREF - i)
rules in all buckets, that is N / O(n) (> O(1)) rules per bucket.

This estimation is rather rough but it clearly shows that choosing
PREF near to epref_max breaks the O(1) requirement for buckets.

This is a fundamental issue related to the use of hashes in this
context. It shows that it is _impossible_ to create a hash which
meets the requirement of O(1) rules per bucket in the setting
described above regardless how clever you choose the hash function.

What do you think about it? Do you agree?

>>In theory you can do the following. Let's consider one
>>dimension. The matches in one dimension form a set of
>>elementary intervals which are overlapped by certain rules.
>>
>> [snipped]
> 
> right. Why do you refer to this as one dimension?

The term 'dimension' is related to the PCP (packet
classification problem):
Given a d-dimensional universe {i | 0 <= i < U}^d,
       a set of n rules {r_1, ..., r_n} where
         r_i = (cost, ([a_1, b_1], ..., [a_d, b_d]))
         with 0 <= a_i <= b_i < U
   and a packet (p_1, ..., p_d) where 0 <= p_i < U
one has to find the least cost rule matching the packet.
[The cost values must be unique per ruleset]

Translated to real world packet classification a dimension
is for example src ip, dst ip, proto, src port, dst port etc.

>>BTW, this approach can be extended to 2 or more dimensions
>>where the hash function for each hash has to meet the
>>requirement. Of course this information is not very helpful
>>since the problem of defining appropriate hash functions
>>remains ;)
> 
> Is that problem even solvable?

Well, one could argue that the _problem_ is not precisely
defined to answer this question since "easily (= fast)
computable hash function" is rather spongy.

Ignoring the "fast computable" requirement, the problem
is solvable simply because the hash function is finite ...
but again this does not help a bit ;-)

So, do you agree that it is _impossible_ to express arbitrary
access lists using the hash tree approach if each hash bucket
is required to hold O(1) rules?
[You already agreed that this requirement is needed for the
  tree of hashes to perform optimally.]

> This is why i was wondering how fast your instertions and deletions are.
> Seems to me you will have to sort the rules everytime.

No, there is no sorting of rules involved. Each rule insert
operation inserts the native matches of the rule into the
(dim)tree.

> Dont you think this "dynamic adjustment" is unnecessary. Essentially you
> enforce a model where every rule is a different priority.

Well, we consider the model beneficial for the user.
First of all the requirement is also present in the
classical PCP definition ;) and secondly it supports
dynamic rulesets better because using static priorities
you ran into the problem of redefining a lot of prios
manually sooner or later.

> The general idea is to recreate the tree if need be based on colisions.
> I just hope some idiot reading this doesnt go and patent it(has happened
> before).

Well, this would be bad of course although it shouldn't be
possible to reverse engineer the algorithm from your
descrition ;)

> I was wondering why not just optimize it for Linux. You are trying to
> center the world around nf-hipac - I would just center it around 
> Linux ;->

LOL, this is _really_ funny :-))
In fact, we just try to be as general as possible without
degrading performance. So far, everything integrates fine
into Linux. Although the nf-hipac patch is rather large
(530 KB), only 205 lines of code have to be added to existing
kernel files.

>>Well, in order to support iptables matches and targets we had
>>to create an appropriate abstraction for them on the hipac
>>layer. This abstraction can also be used for tc classifiers
>>if the tcf_result is ignored, i.e. you just consider whether
>>the filter matched or not.
> 
> I am not sure i understood the part about ignoring tcf_result.

If there would be no tcf_result then classifiers are
simply matches, e.g. like iptables matches. They don't have
an action associated with them. In iptables, from a technical
viewpoint you could say that a target is basically the same
as a match with an action attached to it.

> You show only one classid per rule .. I think i can see what you meant
> by ignoring tcf_result - essentially you want to have a series of filter
> rules with different classification enngines, no?

Hm, rather one rule with several classifiers and native matches.

> But could you not have
> the filters repeat the same classid for every filter? 

Each rule can only have one action or return "value"
attached to it so if we want to use embedded classifiers
(as matches) their classid (= action) must be ignored.
The reason why we want to have several non-native
matches in one rule is to allow the expression of a
conjunction of these matches. This is not possible using
several rules.

> Also it seems you want to be able to have an action defined for
> "nomatch" as well as "match" - is that correct? Some form of
> reclassification when nomatch ?

No, we don't want special actions defined for "match" or
"nomatch". The action for a rule is already given by
its classid.

Regards,

+-----------------------+----------------------+
|   Michael Bellion     |     Thomas Heinz     |
| <mbellion@hipac.org>  |  <creatix@hipac.org> |
+-----------------------+----------------------+
|    High Performance Packet Classification    |
|       nf-hipac: http://www.hipac.org/        |
+----------------------------------------------+

[-- Attachment #2: Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-11 22:39               ` Michael Bellion and Thomas Heinz
@ 2003-08-12  2:57                 ` jamal
  2003-08-12 14:56                   ` Michael Bellion and Thomas Heinz
  2003-08-12  5:40                 ` David S. Miller
  1 sibling, 1 reply; 24+ messages in thread
From: jamal @ 2003-08-12  2:57 UTC (permalink / raw)
  To: Michael Bellion and Thomas Heinz; +Cc: linux-net, netdev

Hi,

On Mon, 2003-08-11 at 18:39, Michael Bellion and Thomas Heinz wrote:

> > It was very close ;-> The guy looked motivated i felt scared for a while
> > that he will be asking a lot of questions. Then i never heard about it
> > again ;-> I think he left town too. 
> 
> Too bad, you should have told the guy about all the fame
> and glory ;-)
> 

Some fires scare even brave firemen. This was a brave fireman until he
saw the fire ;->

[..]

> 
> This is a fundamental issue related to the use of hashes in this
> context. It shows that it is _impossible_ to create a hash which
> meets the requirement of O(1) rules per bucket in the setting
> described above regardless how clever you choose the hash function.
> 
> What do you think about it? Do you agree?
> 

I have to go back and read the theory again to fully comprehend it but
intutively you are right. I claim that if you knew your input well then
you could construct a hash which will closely approach perfect hashing.
So if i was to use the example you gave me earlier i could approximate a
o(1) hash. u32 allows me to do so once i know how things will look like.
Of course i have to do that mnaully with a pencil and paper.

> 
> >>BTW, this approach can be extended to 2 or more dimensions
> >>where the hash function for each hash has to meet the
> >>requirement. Of course this information is not very helpful
> >>since the problem of defining appropriate hash functions
> >>remains ;)
> > 
> > Is that problem even solvable?
> 
> Well, one could argue that the _problem_ is not precisely
> defined to answer this question since "easily (= fast)
> computable hash function" is rather spongy.
> 
> Ignoring the "fast computable" requirement, the problem
> is solvable simply because the hash function is finite ...
> but again this does not help a bit ;-)
> 
> So, do you agree that it is _impossible_ to express arbitrary
> access lists using the hash tree approach if each hash bucket
> is required to hold O(1) rules?
> [You already agreed that this requirement is needed for the
>   tree of hashes to perform optimally.]
> 

yes. But note the caveat i put above on knowing the input and being able
to manually use a pencil to map out the hash. Now automate the pencil
and paper and let some cpu analyse the traffic patterns as well as
adjust the hashes ... maybe a thesis for someone.

> > This is why i was wondering how fast your instertions and deletions are.
> > Seems to me you will have to sort the rules everytime.
> 
> No, there is no sorting of rules involved. Each rule insert
> operation inserts the native matches of the rule into the
> (dim)tree.
> 

I will have to cathup with you at some point - i think i understand what
you mean.

> > Dont you think this "dynamic adjustment" is unnecessary. Essentially you
> > enforce a model where every rule is a different priority.
> 
> Well, we consider the model beneficial for the user.
> First of all the requirement is also present in the
> classical PCP definition ;) and secondly it supports
> dynamic rulesets better because using static priorities
> you ran into the problem of redefining a lot of prios
> manually sooner or later.
> 

well, with the iptables scheme at least you need to know where to insert
the rules; so theres some manual labor involved there as well ;->
Out of curiosity with such a model how do you define a default rule?
In the prio model(and tc), you can have a default catchall rule in the
lowest priority, this way should higher prios fail this will catchall.

[..]

> > 
> > I am not sure i understood the part about ignoring tcf_result.
> 
> If there would be no tcf_result then classifiers are
> simply matches, e.g. like iptables matches. They don't have
> an action associated with them. In iptables, from a technical
> viewpoint you could say that a target is basically the same
> as a match with an action attached to it.
> 

Ok, i see your point, so you are saying that a tcf_result is infact a
hardcoded action which should probably be called "setclassid".
Interesting thought.

[..]

> > But could you not have
> > the filters repeat the same classid for every filter? 
> 
> Each rule can only have one action or return "value"
> attached to it so if we want to use embedded classifiers
> (as matches) their classid (= action) must be ignored.

you want the "continue" action essentially.

> The reason why we want to have several non-native
> matches in one rule is to allow the expression of a
> conjunction of these matches. This is not possible using
> several rules.

I think i finally see your point. Yes, this could be an improvement that
could be made to tc. Infact you have motivated me to write a
"setclassid" action which will override the classid defined otherwise.

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-11 22:39               ` Michael Bellion and Thomas Heinz
  2003-08-12  2:57                 ` jamal
@ 2003-08-12  5:40                 ` David S. Miller
  2003-08-12 14:29                   ` Jamie Lokier
  2003-08-12 14:56                   ` Michael Bellion and Thomas Heinz
  1 sibling, 2 replies; 24+ messages in thread
From: David S. Miller @ 2003-08-12  5:40 UTC (permalink / raw)
  To: Michael Bellion and Thomas Heinz; +Cc: hadi, linux-net, netdev

On Tue, 12 Aug 2003 00:39:58 +0200
Michael Bellion and Thomas Heinz <nf@hipac.org> wrote:

> This is a fundamental issue related to the use of hashes in this
> context. It shows that it is _impossible_ to create a hash which
> meets the requirement of O(1) rules per bucket in the setting
> described above regardless how clever you choose the hash function.
> 
> What do you think about it? Do you agree?

The ipv4 FIB hash tables tackle exactly this problem.  The resulting
worse cast complexity is O(n_bits_in_prefix_mask).

The problem you've described is exactly the same as trying to use
hashing for routing tables.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-12  5:40                 ` David S. Miller
@ 2003-08-12 14:29                   ` Jamie Lokier
  2003-08-12 15:39                     ` Michael Bellion and Thomas Heinz
  2003-08-13 17:28                     ` Ralph Doncaster
  2003-08-12 14:56                   ` Michael Bellion and Thomas Heinz
  1 sibling, 2 replies; 24+ messages in thread
From: Jamie Lokier @ 2003-08-12 14:29 UTC (permalink / raw)
  To: David S. Miller; +Cc: Michael Bellion and Thomas Heinz, hadi, linux-net, netdev

David S. Miller wrote:
> On Tue, 12 Aug 2003 00:39:58 +0200
> Michael Bellion and Thomas Heinz <nf@hipac.org> wrote:
> 
> > This is a fundamental issue related to the use of hashes in this
> > context. It shows that it is _impossible_ to create a hash which
> > meets the requirement of O(1) rules per bucket in the setting
> > described above regardless how clever you choose the hash function.
> > 
> > What do you think about it? Do you agree?
> 
> The ipv4 FIB hash tables tackle exactly this problem.  The resulting
> worse cast complexity is O(n_bits_in_prefix_mask).
> 
> The problem you've described is exactly the same as trying to use
> hashing for routing tables.

You can do it in O(log(n_bits_in_prefix_mask)).

This is achieved using binary search on the prefix lengths which
appear in the table.  (So if only a few prefix lengths are used, the
tree is small).

Illustrated by an example.  The example assumes 32 bits total and
every prefix length appears in the table.  Each node in the search
tree has one or more of these fields:

   - TABLE: Sparse (hash) or dense (direct) mapping of some
     subset of input bits to subtree nodes.
   - SHORTER: Subtree node, for matching shorter prefixes than the
     one which lead to the current node.
   - BEST: Best matching result given the prefix that lead to this node.
     This is either an exact match for that prefix, or the best
     shorter-prefix match for it.
     (On the root node, this will be the default value).

In the worst case of N bits in the prefix, and every prefix length is
actually used, there would by N tree nodes, however the depth of the
search tree is ceil(log2(N)).

This is the lookup algorithm:

   - Start with the root tree node.
     Note that this isn't the shortest prefix matcher.
     It will match the "middle" prefix length, given your set of lengths.
     E.g. on a 32-bit address, with rules of all prefix lenghts, the root
     tree node would match 16 bits.
LOOP:
   - If NODE->TABLE exists, lookup the appropriate subset of input bits in it.
     For sparse tables, this involves a hash lookup.
     If a match is found, select that as the new tree node and goto LOOP.
   - Otherwise there is no table or no match in the table.
   - If NODE->SHORTER exists, select that as the new tree node and goto LOOP.
   - Otherwise, return NODE->BEST.

This generalises to multiple dimensions e.g. for doing multiple
prefixes on source+target + different combinations of other bits such
as protocol, TOS etc. - i.e. arbitrary bit-subset classifiers.  The
basic principle and the algorithm are the same.

In some cases it is worth reducing the tree size a little, e.g. if
prefix lengths P and P+1 appear, you may as well just hash on P+1 bits
and insert all the P prefix matches into the P+1 table, instead of
having a P table.  The cost is up to twice the number of entries in
the P+1 table (depending on how dense your set of P+1 rules is), but
this may be better than the cost of a extra hash lookup.  Naturally,
there are other optimisations too.

-- Jamie

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-12  2:57                 ` jamal
@ 2003-08-12 14:56                   ` Michael Bellion and Thomas Heinz
  2003-08-12 15:41                     ` jamal
  0 siblings, 1 reply; 24+ messages in thread
From: Michael Bellion and Thomas Heinz @ 2003-08-12 14:56 UTC (permalink / raw)
  To: hadi; +Cc: linux-net, netdev

[-- Attachment #1: Type: text/plain, Size: 3042 bytes --]

Hi Jamal

You wrote:
> Some fires scare even brave firemen. This was a brave fireman until he
> saw the fire ;->

LOL :-)

> I have to go back and read the theory again to fully comprehend it but
> intutively you are right. I claim that if you knew your input well then
> you could construct a hash which will closely approach perfect hashing.
> So if i was to use the example you gave me earlier i could approximate a
> o(1) hash. u32 allows me to do so once i know how things will look like.
> Of course i have to do that mnaully with a pencil and paper.

Apart from the fact that the "pencil and paper" approach is not
feasible for complex examples where there are no strict regularities
in the rule set, you simply cannot avoid the implications when
choosing a certain prefix length for the hash (as described in the
previous posting). So probably in most cases even "pencil and paper"
won't help.

> yes. But note the caveat i put above on knowing the input and being able
> to manually use a pencil to map out the hash. Now automate the pencil
> and paper and let some cpu analyse the traffic patterns as well as
> adjust the hashes ... maybe a thesis for someone.

Yes, maybe someone should dig further but as long as the
assumptions stay the same you have to live with the consequences
stated in our last posting.

> well, with the iptables scheme at least you need to know where to insert
> the rules; so theres some manual labor involved there as well ;->
> Out of curiosity with such a model how do you define a default rule?
> In the prio model(and tc), you can have a default catchall rule in the
> lowest priority, this way should higher prios fail this will catchall.

In theory the default rule has priority "infinity".
In an implementation you can simply choose the priority of the default
rule 1 + max{prio(r) | for all rules r}. So it's increased by 1 for
every insert operation.

> Ok, i see your point, so you are saying that a tcf_result is infact a
> hardcoded action which should probably be called "setclassid".
> Interesting thought.

Yes :-)

>>Each rule can only have one action or return "value"
>>attached to it so if we want to use embedded classifiers
>>(as matches) their classid (= action) must be ignored.
> 
> you want the "continue" action essentially.

Sorry but we don't seem to get your point here.
How is the "continue" action related to having multiple
embedded classifiers in a rule which essentially provides
1 single action?

> I think i finally see your point. Yes, this could be an improvement that
> could be made to tc. Infact you have motivated me to write a
> "setclassid" action which will override the classid defined otherwise.

:-)


Regards,

+-----------------------+----------------------+
|   Michael Bellion     |     Thomas Heinz     |
| <mbellion@hipac.org>  |  <creatix@hipac.org> |
+-----------------------+----------------------+
|    High Performance Packet Classification    |
|       nf-hipac: http://www.hipac.org/        |
+----------------------------------------------+

[-- Attachment #2: Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-12  5:40                 ` David S. Miller
  2003-08-12 14:29                   ` Jamie Lokier
@ 2003-08-12 14:56                   ` Michael Bellion and Thomas Heinz
  1 sibling, 0 replies; 24+ messages in thread
From: Michael Bellion and Thomas Heinz @ 2003-08-12 14:56 UTC (permalink / raw)
  To: David S. Miller; +Cc: hadi, linux-net, netdev

[-- Attachment #1: Type: text/plain, Size: 1446 bytes --]

Hi David

You wrote:
> The ipv4 FIB hash tables tackle exactly this problem.  The resulting
> worse cast complexity is O(n_bits_in_prefix_mask).
> 
> The problem you've described is exactly the same as trying to use
> hashing for routing tables.

Yes, that's true for the 1-dimensional case. You refer to the
following algorithm, don't you?

    min_prio := infinity
    match_rule := nil
    for all list_entries e: { // at most w+1 entries where w is the bit
                              // width of the keys
        rule = lookup(hash(e), packet)  // let's assume O(1) here
        if (prio(rule) < min_prio) {
		min_prio = prio(rule)
                 match_rule = rule
        }
    }
    return match_rule

This algorithm has running time O(w).
[In fact it's O(number of different prefix lengths) which is O(w)
  in the worst case.]

But what about the d-dimensional case? If you extend this
approach to handle rules with d prefix based matches you end
up in a running time of:  O(w^d)
                          (assuming w to be the max. bit width)

This is definitely not desirable.


Regards,

+-----------------------+----------------------+
|   Michael Bellion     |     Thomas Heinz     |
| <mbellion@hipac.org>  |  <creatix@hipac.org> |
+-----------------------+----------------------+
|    High Performance Packet Classification    |
|       nf-hipac: http://www.hipac.org/        |
+----------------------------------------------+

[-- Attachment #2: Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-12 14:29                   ` Jamie Lokier
@ 2003-08-12 15:39                     ` Michael Bellion and Thomas Heinz
  2003-08-15 14:28                       ` Jamie Lokier
  2003-08-13 17:28                     ` Ralph Doncaster
  1 sibling, 1 reply; 24+ messages in thread
From: Michael Bellion and Thomas Heinz @ 2003-08-12 15:39 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: David S. Miller, hadi, linux-net, netdev

[-- Attachment #1: Type: text/plain, Size: 1599 bytes --]

Hi Jamie

You wrote:
> You can do it in O(log(n_bits_in_prefix_mask)).
> 
> This is achieved using binary search on the prefix lengths which
> appear in the table.  (So if only a few prefix lengths are used, the
> tree is small).
> 
> [example snipped]

That's true for the one dimensional PCP (with prefix rules) if
the following condition holds:
for all rules r1, r2: (prefix(r1) > prefix(r2)) &&
                       (r1 & prefix(r2) == r2 & prefix(r2))
                    => prio(r1) < prio(r2)
                       [smallest prio wins]

It's quite reasonable to assume that this condition holds
for the one dimensional case since there would be never
matching rules otherwise.

> This generalises to multiple dimensions e.g. for doing multiple
> prefixes on source+target + different combinations of other bits such
> as protocol, TOS etc. - i.e. arbitrary bit-subset classifiers.  The
> basic principle and the algorithm are the same.

Hm, how do you want to solve the d-dimensional PCP by
doing binary search for each dimension? Remember that
PCP is not related to longest prefix matching. Instead
priorities are used.

Maybe you should describe in little more detail what you mean
by "This generalises to multiple dimensions ...".


Regards,

+-----------------------+----------------------+
|   Michael Bellion     |     Thomas Heinz     |
| <mbellion@hipac.org>  |  <creatix@hipac.org> |
+-----------------------+----------------------+
|    High Performance Packet Classification    |
|       nf-hipac: http://www.hipac.org/        |
+----------------------------------------------+

[-- Attachment #2: Type: application/pgp-signature, Size: 251 bytes --]

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-12 14:56                   ` Michael Bellion and Thomas Heinz
@ 2003-08-12 15:41                     ` jamal
  0 siblings, 0 replies; 24+ messages in thread
From: jamal @ 2003-08-12 15:41 UTC (permalink / raw)
  To: nf; +Cc: linux-net, netdev

On Tue, 2003-08-12 at 10:56, Michael Bellion and Thomas Heinz wrote:
> Hi Jamal
> 
> Apart from the fact that the "pencil and paper" approach is not
> feasible for complex examples where there are no strict regularities
> in the rule set, you simply cannot avoid the implications when
> choosing a certain prefix length for the hash (as described in the
> previous posting). So probably in most cases even "pencil and paper"
> won't help.
> 

If you give me _sufficient time_ it should work ;->
Note, i am not defining sufficient time ;-> Its a different
approach,your approach - when you get it right- is the right way.

> > yes. But note the caveat i put above on knowing the input and being able
> > to manually use a pencil to map out the hash. Now automate the pencil
> > and paper and let some cpu analyse the traffic patterns as well as
> > adjust the hashes ... maybe a thesis for someone.
> 
> Yes, maybe someone should dig further but as long as the
> assumptions stay the same you have to live with the consequences
> stated in our last posting.
> 

If you have any students interested in a thesis let me know.

> > well, with the iptables scheme at least you need to know where to insert
> > the rules; so theres some manual labor involved there as well ;->
> > Out of curiosity with such a model how do you define a default rule?
> > In the prio model(and tc), you can have a default catchall rule in the
> > lowest priority, this way should higher prios fail this will catchall.
> 
> In theory the default rule has priority "infinity".
> In an implementation you can simply choose the priority of the default
> rule 1 + max{prio(r) | for all rules r}. So it's increased by 1 for
> every insert operation.
> 

And the reason you do this is because you dont know how many rules will
exist ahead of time. With static priorities you can set the default to
the max value.


> >>Each rule can only have one action or return "value"
> >>attached to it so if we want to use embedded classifiers
> >>(as matches) their classid (= action) must be ignored.
> > 
> > you want the "continue" action essentially.
> 
> Sorry but we don't seem to get your point here.
> How is the "continue" action related to having multiple
> embedded classifiers in a rule which essentially provides
> 1 single action?
> 

tc add <filter1 desc> continue <filter2 desc> continue <filter3 desc>
accept

- filter1-3 are derived from different classifiers. the accept at the
end is doable if you use the tc extensions i am working on. 

on the setclassid stuff i will ping Alexey and Davem on it - it could be
a 2.7 action item. For now, you will have to live with it. BTW, have you
tried to leave out defining the classid definition?;-> policy gets
accepted fine, i think its a bug but it could be a feature - dont have
time to look at it right now.

cheers,
jamal

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-12 14:29                   ` Jamie Lokier
  2003-08-12 15:39                     ` Michael Bellion and Thomas Heinz
@ 2003-08-13 17:28                     ` Ralph Doncaster
  2003-08-13 19:17                       ` Jamie Lokier
  1 sibling, 1 reply; 24+ messages in thread
From: Ralph Doncaster @ 2003-08-13 17:28 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: netdev


On Tue, 12 Aug 2003, Jamie Lokier wrote:
> David S. Miller wrote:
> > On Tue, 12 Aug 2003 00:39:58 +0200
> > The ipv4 FIB hash tables tackle exactly this problem.  The resulting
> > worse cast complexity is O(n_bits_in_prefix_mask).
> >
> > The problem you've described is exactly the same as trying to use
> > hashing for routing tables.
>
> You can do it in O(log(n_bits_in_prefix_mask)).
>
> This is achieved using binary search on the prefix lengths which
> appear in the table.  (So if only a few prefix lengths are used, the
> tree is small).
>
> Illustrated by an example.  The example assumes 32 bits total and
> every prefix length appears in the table.  Each node in the search
> tree has one or more of these fields:
>
>    - TABLE: Sparse (hash) or dense (direct) mapping of some
>      subset of input bits to subtree nodes.
>    - SHORTER: Subtree node, for matching shorter prefixes than the
>      one which lead to the current node.
>    - BEST: Best matching result given the prefix that lead to this node.
>      This is either an exact match for that prefix, or the best
>      shorter-prefix match for it.
>      (On the root node, this will be the default value).
>
> In the worst case of N bits in the prefix, and every prefix length is
> actually used, there would by N tree nodes, however the depth of the
> search tree is ceil(log2(N)).
>
> This is the lookup algorithm:
>
>    - Start with the root tree node.
>      Note that this isn't the shortest prefix matcher.
>      It will match the "middle" prefix length, given your set of lengths.
>      E.g. on a 32-bit address, with rules of all prefix lenghts, the root
>      tree node would match 16 bits.
> LOOP:
>    - If NODE->TABLE exists, lookup the appropriate subset of input bits in it.
>      For sparse tables, this involves a hash lookup.
>      If a match is found, select that as the new tree node and goto LOOP.
>    - Otherwise there is no table or no match in the table.
>    - If NODE->SHORTER exists, select that as the new tree node and goto LOOP.
>    - Otherwise, return NODE->BEST.
>
> This generalises to multiple dimensions e.g. for doing multiple
> prefixes on source+target + different combinations of other bits such
> as protocol, TOS etc. - i.e. arbitrary bit-subset classifiers.  The
> basic principle and the algorithm are the same.
>
> In some cases it is worth reducing the tree size a little, e.g. if
> prefix lengths P and P+1 appear, you may as well just hash on P+1 bits
> and insert all the P prefix matches into the P+1 table, instead of
> having a P table.  The cost is up to twice the number of entries in
> the P+1 table (depending on how dense your set of P+1 rules is), but
> this may be better than the cost of a extra hash lookup.  Naturally,
> there are other optimisations too.

Most of this doesn't make sense to me.  The part about merging
prefix-length tables is OK (i.e. 10.10.0.0/16 and 10.10.1.0/17 becomes
10.10.0.0/17 and 10.10.1.0/17).

However if you have 32 tables (one for each prefix length) I can't see any
possible way of avoiding a search in each table for the worst case.  So if
my packet is destined to 217.109.118.70 you need to start looking for a
match for a /32 route, and if not found continue checking each prefix size
until you've reached /0.

-Ralph

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-13 17:28                     ` Ralph Doncaster
@ 2003-08-13 19:17                       ` Jamie Lokier
  2003-08-13 21:10                         ` Ralph Doncaster
  0 siblings, 1 reply; 24+ messages in thread
From: Jamie Lokier @ 2003-08-13 19:17 UTC (permalink / raw)
  To: ralph+d; +Cc: netdev

Ralph Doncaster wrote:
> However if you have 32 tables (one for each prefix length) I can't see any
> possible way of avoiding a search in each table for the worst case.  So if
> my packet is destined to 217.109.118.70 you need to start looking for a
> match for a /32 route, and if not found continue checking each prefix size
> until you've reached /0.

You would start by search for a 217.109.0.0/16 entry.  That's
the root in the search tree.

That would match, and the matching tree node would tell you to search
a specific table for 217.109.118.0/24.  (Actually, just
0.0.118.0/0.0.255.0, because this node can assume the first 16 bits).

That would match, and the matching tree node tells you to search a
specific table for 217.109.118.64/28 (0.0.0.64/0.0.0.240).

That would match, and the matching tree node tells you to search a
specific table for 217.109.118.68/32 (0.0.0.6/0.0.0.15).

That would match, and is your result.

Without the optimisation you said you understand, there would be a few
more steps, narrowing to /30, /31 then /32, but for such small hashes,
not only is it faster, it also uses less memory to just use a single
4-bit table lookup than to have a tree of tiny hash tables.

-- Jamie

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-13 19:17                       ` Jamie Lokier
@ 2003-08-13 21:10                         ` Ralph Doncaster
  2003-08-13 23:21                           ` Jamie Lokier
  0 siblings, 1 reply; 24+ messages in thread
From: Ralph Doncaster @ 2003-08-13 21:10 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: netdev

On Wed, 13 Aug 2003, Jamie Lokier wrote:

> You would start by search for a 217.109.0.0/16 entry.  That's
> the root in the search tree.
>
> That would match, and the matching tree node would tell you to search
> a specific table for 217.109.118.0/24.  (Actually, just
> 0.0.118.0/0.0.255.0, because this node can assume the first 16 bits).

So you have to put an entry in the /16 table for every /16 that you have a
more specific route for, right?
Then what if I have 3 different routes; one for 217.109.0.0/16, another
for 217.109.118.0/24 and one for 217.109.118.68/32?

-Ralph

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-13 21:10                         ` Ralph Doncaster
@ 2003-08-13 23:21                           ` Jamie Lokier
  0 siblings, 0 replies; 24+ messages in thread
From: Jamie Lokier @ 2003-08-13 23:21 UTC (permalink / raw)
  To: ralph+d; +Cc: netdev

Ralph Doncaster wrote:
> So you have to put an entry in the /16 table for every /16 that you have a
> more specific route for, right?
> Then what if I have 3 different routes; one for 217.109.0.0/16, another
> for 217.109.118.0/24 and one for 217.109.118.68/32?

Then you would have one entry in the /16 table, matching
217.109.0.0/16, whose BEST value is the first route and whose LARGER
points to another table.

The second table would have one entry matching 217.109.118.0/24, whose
BEST value is the second route, and whose LARGER points to another
table.

The third table would have one entry matching 217.109.118.68/32, whose
BEST value is the third route, and which has no LARGER.

That's three hash tables, each containing just one entry.

-- Jamie

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC] High Performance Packet Classifiction for tc framework
  2003-08-12 15:39                     ` Michael Bellion and Thomas Heinz
@ 2003-08-15 14:28                       ` Jamie Lokier
  0 siblings, 0 replies; 24+ messages in thread
From: Jamie Lokier @ 2003-08-15 14:28 UTC (permalink / raw)
  To: Michael Bellion and Thomas Heinz; +Cc: David S. Miller, hadi, linux-net, netdev

Michael Bellion and Thomas Heinz wrote:
> >This generalises to multiple dimensions e.g. for doing multiple
> >prefixes on source+target + different combinations of other bits such
> >as protocol, TOS etc. - i.e. arbitrary bit-subset classifiers.  The
> >basic principle and the algorithm are the same.
> 
> Hm, how do you want to solve the d-dimensional PCP by
> doing binary search for each dimension? Remember that
> PCP is not related to longest prefix matching. Instead
> priorities are used.

I don't know what you mean by "PCP", so can't answer the question.

> Maybe you should describe in little more detail what you mean
> by "This generalises to multiple dimensions ...".

I mean that the lookup algorithm works for multi-dimensional searches.
Creating the search tree can be a little more involved, and I am not
sure how much (if any) node duplication is needed when some kinds of
rule priorities are used.

It may help to say that you don't have to do binary search for each
dimension separately, although that is a possible strategy for prefix
matching.

You can build a tree where each node represents a point (a,b,c...) in
prefix-length space, and whose children are other points in that
space, if you choose to see the general problem as matching multiple
prefixes.

At one extreme, a general bit matcher (i.e. no non-power-of-2
numerical ranges) can be treated as a multi-dimensional prefix match
where each prefix is length 0 or 1.  You see that, if the input rule
set is _equivalent_ to a longest-prefix single-dimensional rule set,
the optimal multi-dimensional search tree is trivally found and it
does not do binary search on each dimension separately.

-- Jamie

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2003-08-15 14:28 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-07-14  8:45 [RFC] High Performance Packet Classifiction for tc framework Michael Bellion and Thomas Heinz
2003-07-16  5:02 ` jamal
2003-07-17 13:13   ` Michael Bellion and Thomas Heinz
2003-08-03 18:14     ` jamal
2003-08-04 13:17       ` Michael Bellion and Thomas Heinz
2003-08-04 15:51         ` jamal
2003-08-05 22:21           ` Michael Bellion and Thomas Heinz
2003-08-07 19:58             ` jamal
2003-08-07 20:05               ` David S. Miller
2003-08-08 21:51                 ` jamal
2003-08-09  0:01                   ` David S. Miller
2003-08-11 22:39               ` Michael Bellion and Thomas Heinz
2003-08-12  2:57                 ` jamal
2003-08-12 14:56                   ` Michael Bellion and Thomas Heinz
2003-08-12 15:41                     ` jamal
2003-08-12  5:40                 ` David S. Miller
2003-08-12 14:29                   ` Jamie Lokier
2003-08-12 15:39                     ` Michael Bellion and Thomas Heinz
2003-08-15 14:28                       ` Jamie Lokier
2003-08-13 17:28                     ` Ralph Doncaster
2003-08-13 19:17                       ` Jamie Lokier
2003-08-13 21:10                         ` Ralph Doncaster
2003-08-13 23:21                           ` Jamie Lokier
2003-08-12 14:56                   ` Michael Bellion and Thomas Heinz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.