All of lore.kernel.org
 help / color / mirror / Atom feed
* tc: u32: Wrong sample hash calculation
@ 2021-01-18 11:29 Phil Sutter
  2021-01-20 13:55 ` Jamal Hadi Salim
  0 siblings, 1 reply; 7+ messages in thread
From: Phil Sutter @ 2021-01-18 11:29 UTC (permalink / raw)
  To: Jamal Hadi Salim, Stephen Hemminger
  Cc: netdev, Cong Wang, Jiri Pirko, Russell Stuart

Hi!

Playing with u32 filter's hash table I noticed it is not possible to use
'sample' option with keys larger than 8bits to calculate the hash
bucket. Turns out key hashing in kernel and iproute2 differ:

* net/sched/cls_u32.c (kernel) basically does:

hash = ntohl(key & mask);
hash >>= ffs(ntohl(mask)) - 1;
hash &= 0xff;
hash %= divisor;

* while tc/f_u32.c (iproute2) does:

hash = key & mask;
hash ^= hash >> 16;
hash ^= hash >> 8;
hash %= divisor;

In iproute2, the code changed in 2006 with commit 267480f55383c
("Backout the 2.4 utsname hash patch."), here's the relevant diff:

  hash = sel2.sel.keys[0].val&sel2.sel.keys[0].mask;
- uname(&utsname);
- if (strncmp(utsname.release, "2.4.", 4) == 0) {
-         hash ^= hash>>16;
-         hash ^= hash>>8;
- }
- else {
-         __u32 mask = sel2.sel.keys[0].mask;
-         while (mask && !(mask & 1)) {
-                 mask >>= 1;
-                 hash >>= 1;
-         }
-         hash &= 0xFF;
- }
+ hash ^= hash>>16;
+ hash ^= hash>>8;
  htid = ((hash%divisor)<<12)|(htid&0xFFF00000);

The old code would work if key and mask weren't in network byteorder. I
guess that also changed since then.

I would simply send a patch to fix iproute2, but I don't like the
kernel's hash "folding" as it ignores any bits beyond the first eight.
So I would prefer to "fix" the kernel instead but would like to hear
your opinions as that change has a much larger scope than just
iproute2's 'sample' option.

Thanks, Phil

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: tc: u32: Wrong sample hash calculation
  2021-01-18 11:29 tc: u32: Wrong sample hash calculation Phil Sutter
@ 2021-01-20 13:55 ` Jamal Hadi Salim
  2021-01-20 15:23   ` Phil Sutter
  0 siblings, 1 reply; 7+ messages in thread
From: Jamal Hadi Salim @ 2021-01-20 13:55 UTC (permalink / raw)
  To: Phil Sutter, Stephen Hemminger, netdev, Cong Wang, Jiri Pirko,
	Russell Stuart

Hi,

On 2021-01-18 6:29 a.m., Phil Sutter wrote:
> Hi!
> 
> Playing with u32 filter's hash table I noticed it is not possible to use
> 'sample' option with keys larger than 8bits to calculate the hash
> bucket. 


I have mostly used something like: ht 2:: sample ip protocol 1 0xff
Hoping this is continuing to work.

I feel i am missing something basic in the rest of your email:
Sample is a user space concept i.e it is used to instruct the
kernel what table/bucket to insert the node into. This computation
is done in user space. The kernel should just walk the nodes in
the bucket and match.
Reminder: you can only have 256 buckets (8 bit representation).
Could that be the contributing factor?
Here's an example of something which is not 8 bit that i found in
an old script that should work (but I didnt test in current kernels).
ht 2:: sample u32 0x00000800 0x0000ff00 at 12
We are still going to extract only 8 bits for the bucket.

Can you provide an example of what wouldnt work?

cheers,
jamal

> Turns out key hashing in kernel and iproute2 differ:
> 
> * net/sched/cls_u32.c (kernel) basically does:
> 
> hash = ntohl(key & mask);
> hash >>= ffs(ntohl(mask)) - 1;
> hash &= 0xff;
> hash %= divisor;
> 
> * while tc/f_u32.c (iproute2) does:
> 
> hash = key & mask;
> hash ^= hash >> 16;
> hash ^= hash >> 8;
> hash %= divisor;
> 
> In iproute2, the code changed in 2006 with commit 267480f55383c
> ("Backout the 2.4 utsname hash patch."), here's the relevant diff:
> 
>    hash = sel2.sel.keys[0].val&sel2.sel.keys[0].mask;
> - uname(&utsname);
> - if (strncmp(utsname.release, "2.4.", 4) == 0) {
> -         hash ^= hash>>16;
> -         hash ^= hash>>8;
> - }
> - else {
> -         __u32 mask = sel2.sel.keys[0].mask;
> -         while (mask && !(mask & 1)) {
> -                 mask >>= 1;
> -                 hash >>= 1;
> -         }
> -         hash &= 0xFF;
> - }
> + hash ^= hash>>16;
> + hash ^= hash>>8;
>    htid = ((hash%divisor)<<12)|(htid&0xFFF00000);
> 
> The old code would work if key and mask weren't in network byteorder. I
> guess that also changed since then.
> 
> I would simply send a patch to fix iproute2, but I don't like the
> kernel's hash "folding" as it ignores any bits beyond the first eight.
> So I would prefer to "fix" the kernel instead but would like to hear
> your opinions as that change has a much larger scope than just
> iproute2's 'sample' option.
> 
> Thanks, Phil
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: tc: u32: Wrong sample hash calculation
  2021-01-20 13:55 ` Jamal Hadi Salim
@ 2021-01-20 15:23   ` Phil Sutter
  2021-01-22 11:25     ` Jamal Hadi Salim
  0 siblings, 1 reply; 7+ messages in thread
From: Phil Sutter @ 2021-01-20 15:23 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Stephen Hemminger, netdev, Cong Wang, Jiri Pirko, Russell Stuart

Hi Jamal,

On Wed, Jan 20, 2021 at 08:55:11AM -0500, Jamal Hadi Salim wrote:
> On 2021-01-18 6:29 a.m., Phil Sutter wrote:
> > Hi!
> > 
> > Playing with u32 filter's hash table I noticed it is not possible to use
> > 'sample' option with keys larger than 8bits to calculate the hash
> > bucket. 
> 
> 
> I have mostly used something like: ht 2:: sample ip protocol 1 0xff
> Hoping this is continuing to work.

This should read 'sample ip protocol 1 divisor 0xff', right?

> I feel i am missing something basic in the rest of your email:
> Sample is a user space concept i.e it is used to instruct the
> kernel what table/bucket to insert the node into. This computation
> is done in user space. The kernel should just walk the nodes in
> the bucket and match.

Correct, but the kernel has to find the right bucket first. This is
where its key hashing comes into place.

> Reminder: you can only have 256 buckets (8 bit representation).
> Could that be the contributing factor?

It is. Any key smaller than 256B is unaffected as no folding is done in
either kernel or user space.

> Here's an example of something which is not 8 bit that i found in
> an old script that should work (but I didnt test in current kernels).
> ht 2:: sample u32 0x00000800 0x0000ff00 at 12
> We are still going to extract only 8 bits for the bucket.

Yes. The resulting key is 8Bit as the low zeroes are automatically
shifted away.

> Can you provide an example of what wouldnt work?

Sure, sorry for not including it in the original email. Let's apply
actions to some packets based on source IP address. To efficiently
support arbitrary numbers, we use a hash table with 256 buckets:

# tc qd add dev test0 ingress
# tc filter add dev test0 parent ffff: prio 99 handle 1: u32 divisor 256
# tc filter add dev test0 parent ffff: prio 1 protocol ip u32 \
	hashkey mask 0xffffffff at 12 link 1: match u8 0 0

So with the above in place, the kernel uses 32bits at offset 12 as a key
to determine the bucket to jump to. This is done by just extracting the
lowest 8bits in host byteorder, i.e. the last octet of the packet's
source address.

Users don't know the above (and shouldn't need to), so they use sample
to have the bucket determined automatically:

# tc filter add dev test0 parent ffff: prio 99 u32 \
	match ip src 10.0.0.2 \
	ht 1: sample ip src 10.0.0.2 divisor 256 \
	action drop

iproute2 calculates bucket 8 (= 10 ^ 2), while the kernel will check
bucket 2. So the above filter will never match.

Cheers, Phil

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: tc: u32: Wrong sample hash calculation
  2021-01-20 15:23   ` Phil Sutter
@ 2021-01-22 11:25     ` Jamal Hadi Salim
  2021-01-22 12:24       ` Phil Sutter
  2021-01-22 13:59       ` Phil Sutter
  0 siblings, 2 replies; 7+ messages in thread
From: Jamal Hadi Salim @ 2021-01-22 11:25 UTC (permalink / raw)
  To: Phil Sutter, Stephen Hemminger, netdev, Cong Wang, Jiri Pirko,
	Russell Stuart

Hi Phil,

On 2021-01-20 10:23 a.m., Phil Sutter wrote:
> Hi Jamal,
> 
> On Wed, Jan 20, 2021 at 08:55:11AM -0500, Jamal Hadi Salim wrote:
>> On 2021-01-18 6:29 a.m., Phil Sutter wrote:
>>> Hi!
>>>
>>> Playing with u32 filter's hash table I noticed it is not possible to use
>>> 'sample' option with keys larger than 8bits to calculate the hash
>>> bucket.
>>
>>
>> I have mostly used something like: ht 2:: sample ip protocol 1 0xff
>> Hoping this is continuing to work.
> 
> This should read 'sample ip protocol 1 divisor 0xff', right?
> 

0xff is a mask.
The table(256 buckets) is created earlier. Something like:
filter add dev XXX parent ffff: protocol ip prio 10 handle 2:: u32 
divisor 256
This is from some scripts i have that worked. I cant see anything
that would say they will break today.


>> Reminder: you can only have 256 buckets (8 bit representation).
>> Could that be the contributing factor?
> 
> It is. Any key smaller than 256B is unaffected as no folding is done in
> either kernel or user space.
> 

Ok. I have never used it in any scenario other than 8 bits
(maybe subconsciously because of the 256 bucket limit was playing in
my head). I am not sure if Alexey at the time was thinking it is
useful for more than that.

>> Here's an example of something which is not 8 bit that i found in
>> an old script that should work (but I didnt test in current kernels).
>> ht 2:: sample u32 0x00000800 0x0000ff00 at 12
>> We are still going to extract only 8 bits for the bucket.
> 
> Yes. The resulting key is 8Bit as the low zeroes are automatically
> shifted away.
> 

ok.

>> Can you provide an example of what wouldnt work?
> 
> Sure, sorry for not including it in the original email. Let's apply
> actions to some packets based on source IP address. To efficiently
> support arbitrary numbers, we use a hash table with 256 buckets:
> 
> # tc qd add dev test0 ingress
> # tc filter add dev test0 parent ffff: prio 99 handle 1: u32 divisor 256
> # tc filter add dev test0 parent ffff: prio 1 protocol ip u32 \
> 	hashkey mask 0xffffffff at 12 link 1: match u8 0 0
> 
> So with the above in place, the kernel uses 32bits at offset 12 as a key
> to determine the bucket to jump to. This is done by just extracting the
> lowest 8bits in host byteorder, i.e. the last octet of the packet's
> source address.
> 
> Users don't know the above (and shouldn't need to), so they use sample
> to have the bucket determined automatically:
> 
> # tc filter add dev test0 parent ffff: prio 99 u32 \
> 	match ip src 10.0.0.2 \
> 	ht 1: sample ip src 10.0.0.2 divisor 256 \
> 	action drop
> 
> iproute2 calculates bucket 8 (= 10 ^ 2), while the kernel will check
> bucket 2. So the above filter will never match.
> 

Ok, makes more sense.
Is this always true though for all scenarios of key > 8b?
And is there a pattern that can be deduced?
My gut feel is user space is the right/easier spot to fix this
as long as it doesnt break the working setup of 8b.

cheers,
jamal


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: tc: u32: Wrong sample hash calculation
  2021-01-22 11:25     ` Jamal Hadi Salim
@ 2021-01-22 12:24       ` Phil Sutter
  2021-01-22 13:59       ` Phil Sutter
  1 sibling, 0 replies; 7+ messages in thread
From: Phil Sutter @ 2021-01-22 12:24 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Stephen Hemminger, netdev, Cong Wang, Jiri Pirko, Russell Stuart

Hi Jamal,

On Fri, Jan 22, 2021 at 06:25:22AM -0500, Jamal Hadi Salim wrote:
[...]
> Is this always true though for all scenarios of key > 8b?

Key size reduction algorithms simply differ, and before applying the
divisor the key is reduced to an eight bit value. If the higher bytes
are zero, the result is identical. So for some keys the differences are
irrelevant.

> And is there a pattern that can be deduced?

Something like: 'broken = !!(key >> 8)'? ;)

> My gut feel is user space is the right/easier spot to fix this
> as long as it doesnt break the working setup of 8b.

My motivation to write the initial email was that I don't like the
kernel's key folding as it's basically just cutting off the extra bits.
I am aware that fixing user space is easier, but better distribution of
entries might be worth the extra effort. Given that you didn't point out
any implications with changing u32's key folding in kernel space, I'll
just give it a try.

Thanks, Phil

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: tc: u32: Wrong sample hash calculation
  2021-01-22 11:25     ` Jamal Hadi Salim
  2021-01-22 12:24       ` Phil Sutter
@ 2021-01-22 13:59       ` Phil Sutter
  2021-01-24 13:13         ` Jamal Hadi Salim
  1 sibling, 1 reply; 7+ messages in thread
From: Phil Sutter @ 2021-01-22 13:59 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Stephen Hemminger, netdev, Cong Wang, Jiri Pirko, Russell Stuart

Jamal,

On Fri, Jan 22, 2021 at 06:25:22AM -0500, Jamal Hadi Salim wrote:
[...]
> My gut feel is user space is the right/easier spot to fix this
> as long as it doesnt break the working setup of 8b.

One last attempt at clarifying the situation:

Back in 2004, your commit 4e54c4816bf ("[NET]: Add tc extensions
infrastructure.")[1] was applied which commented out the old hash
folding and introduced the shift/cutoff we have today:

|  @@ -90,10 +101,12 @@ static struct tc_u_common *u32_list;
|  
|  static __inline__ unsigned u32_hash_fold(u32 key, struct tc_u32_sel *sel)
|  {
| -	unsigned h = key & sel->hmask;
| +	unsigned h = (key & sel->hmask)>>sel->fshift;
|  
| +	/*
|  	h ^= h>>16;
|  	h ^= h>>8;
| +	*/
|  	return h;
|  }

In a later commit, the new code was made compile-time selected via '#ifdef
fix_u32_bug'. In that same commit, I don't see a related #define though.

Do you remember why this was changed? Seems like the old code was
problematic somehow.

Cheers, Phil

[1] https://github.com/laijs/linux-kernel-ancient-history/commit/4e54c4816bfe51c145382d272b19c2ae41e9e36f#

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: tc: u32: Wrong sample hash calculation
  2021-01-22 13:59       ` Phil Sutter
@ 2021-01-24 13:13         ` Jamal Hadi Salim
  0 siblings, 0 replies; 7+ messages in thread
From: Jamal Hadi Salim @ 2021-01-24 13:13 UTC (permalink / raw)
  To: Phil Sutter, Stephen Hemminger, netdev, Cong Wang, Jiri Pirko,
	Russell Stuart

Hi Phil,

On 2021-01-22 8:59 a.m., Phil Sutter wrote:
> Jamal,
> 
> On Fri, Jan 22, 2021 at 06:25:22AM -0500, Jamal Hadi Salim wrote:
> [...]
>> My gut feel is user space is the right/easier spot to fix this
>> as long as it doesnt break the working setup of 8b.
> 
> One last attempt at clarifying the situation:
> 
> Back in 2004, your commit 4e54c4816bf ("[NET]: Add tc extensions
> infrastructure.")[1] was applied which commented out the old hash
> folding and introduced the shift/cutoff we have today:
> 
> |  @@ -90,10 +101,12 @@ static struct tc_u_common *u32_list;
> |
> |  static __inline__ unsigned u32_hash_fold(u32 key, struct tc_u32_sel *sel)
> |  {
> | -	unsigned h = key & sel->hmask;
> | +	unsigned h = (key & sel->hmask)>>sel->fshift;
> |
> | +	/*
> |  	h ^= h>>16;
> |  	h ^= h>>8;
> | +	*/
> |  	return h;
> |  }
> 
> In a later commit, the new code was made compile-time selected via '#ifdef
> fix_u32_bug'. In that same commit, I don't see a related #define though.
> 
> Do you remember why this was changed? Seems like the old code was
> problematic somehow.
> 

Vague recollection that it didnt work. I will have to dig deeper in old
email exchanges. Its been like close to 20 years (if you consider there
was work in progress about 1-2 years before that final submission) ;->

cheers,
jamal


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-01-24 13:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-18 11:29 tc: u32: Wrong sample hash calculation Phil Sutter
2021-01-20 13:55 ` Jamal Hadi Salim
2021-01-20 15:23   ` Phil Sutter
2021-01-22 11:25     ` Jamal Hadi Salim
2021-01-22 12:24       ` Phil Sutter
2021-01-22 13:59       ` Phil Sutter
2021-01-24 13:13         ` Jamal Hadi Salim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.