netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Linux Route Cache performance tests
@ 2011-11-06 15:57 Paweł Staszewski
  2011-11-06 17:29 ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Paweł Staszewski @ 2011-11-06 15:57 UTC (permalink / raw)
  To: Linux Network Development list, Eric Dumazet

Hello



I make some networking performance tests for Linux 3.1

Configuration:

Linux (pktget) ----> Linux (router) ----> Linux (Sink)

pktgen config:
clone_skb 32
pkt_size 64
delay 0

pgset "flag IPDST_RND"
pgset "dst_min 10.0.0.0"
pgset "dst_max 10.18.255.255"
pgset "config 1"
pgset "flows 256"
pgset "flowlen 8"

TX performance for this host:
eth0:            RX: 0.00 P/s      TX: 12346107.73 P/s      TOTAL: 
12346107.73 P/s

On Linux (router):
grep . /proc/sys/net/ipv4/route/*
/proc/sys/net/ipv4/route/error_burst:500
/proc/sys/net/ipv4/route/error_cost:100
grep: /proc/sys/net/ipv4/route/flush: Permission denied
/proc/sys/net/ipv4/route/gc_elasticity:4
/proc/sys/net/ipv4/route/gc_interval:60
/proc/sys/net/ipv4/route/gc_min_interval:0
/proc/sys/net/ipv4/route/gc_min_interval_ms:500
/proc/sys/net/ipv4/route/gc_thresh:2000000
/proc/sys/net/ipv4/route/gc_timeout:60
/proc/sys/net/ipv4/route/max_size:8388608
/proc/sys/net/ipv4/route/min_adv_mss:256
/proc/sys/net/ipv4/route/min_pmtu:552
/proc/sys/net/ipv4/route/mtu_expires:600
/proc/sys/net/ipv4/route/redirect_load:2
/proc/sys/net/ipv4/route/redirect_number:9
/proc/sys/net/ipv4/route/redirect_silence:2048

For the first 30secs maybee more router is forwarding ~5Mpps to the 
Linux (Sink)
and some stats for this forst 30secs in attached image:

http://imageshack.us/photo/my-images/684/test1ih.png/

Left up - pktgen linux
left down - Linux router (htop)
Right up - Linux router (bwm-ng - showing pps)
Right down - Linux router (lnstat)


And all is good - performance 5Mpps until Linux router will reach ~1kk 
entries
What You can see on next attached image:

http://imageshack.us/photo/my-images/24/test2id.png/

Forwarding performance drops from 5Mpps to 1,8Mpps
And after 3 - 4 minutes it will stop on 0,7Mpps


After flushing the route cache performance increase from 0.7Mpps to 6Mpps
What You can see on next attached image:

http://imageshack.us/photo/my-images/197/test3r.png/

Is it possible to turn off route cache ? and see what performance will 
be without caching


Thanks
Pawel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux Route Cache performance tests
  2011-11-06 15:57 Linux Route Cache performance tests Paweł Staszewski
@ 2011-11-06 17:29 ` Eric Dumazet
  2011-11-06 18:28   ` Paweł Staszewski
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-06 17:29 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

Le dimanche 06 novembre 2011 à 16:57 +0100, Paweł Staszewski a écrit :
> Hello
> 
> 
> 
> I make some networking performance tests for Linux 3.1
> 
> Configuration:
> 
> Linux (pktget) ----> Linux (router) ----> Linux (Sink)
> 
> pktgen config:
> clone_skb 32
> pkt_size 64
> delay 0
> 
> pgset "flag IPDST_RND"
> pgset "dst_min 10.0.0.0"
> pgset "dst_max 10.18.255.255"
> pgset "config 1"
> pgset "flows 256"
> pgset "flowlen 8"
> 
> TX performance for this host:
> eth0:            RX: 0.00 P/s      TX: 12346107.73 P/s      TOTAL: 
> 12346107.73 P/s
> 
> On Linux (router):
> grep . /proc/sys/net/ipv4/route/*
> /proc/sys/net/ipv4/route/error_burst:500
> /proc/sys/net/ipv4/route/error_cost:100
> grep: /proc/sys/net/ipv4/route/flush: Permission denied
> /proc/sys/net/ipv4/route/gc_elasticity:4
> /proc/sys/net/ipv4/route/gc_interval:60
> /proc/sys/net/ipv4/route/gc_min_interval:0
> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
> /proc/sys/net/ipv4/route/gc_thresh:2000000
> /proc/sys/net/ipv4/route/gc_timeout:60
> /proc/sys/net/ipv4/route/max_size:8388608
> /proc/sys/net/ipv4/route/min_adv_mss:256
> /proc/sys/net/ipv4/route/min_pmtu:552
> /proc/sys/net/ipv4/route/mtu_expires:600
> /proc/sys/net/ipv4/route/redirect_load:2
> /proc/sys/net/ipv4/route/redirect_number:9
> /proc/sys/net/ipv4/route/redirect_silence:2048
> 
> For the first 30secs maybee more router is forwarding ~5Mpps to the 
> Linux (Sink)
> and some stats for this forst 30secs in attached image:
> 
> http://imageshack.us/photo/my-images/684/test1ih.png/
> 
> Left up - pktgen linux
> left down - Linux router (htop)
> Right up - Linux router (bwm-ng - showing pps)
> Right down - Linux router (lnstat)
> 
> 
> And all is good - performance 5Mpps until Linux router will reach ~1kk 
> entries
> What You can see on next attached image:
> 
> http://imageshack.us/photo/my-images/24/test2id.png/
> 
> Forwarding performance drops from 5Mpps to 1,8Mpps
> And after 3 - 4 minutes it will stop on 0,7Mpps
> 
> 
> After flushing the route cache performance increase from 0.7Mpps to 6Mpps
> What You can see on next attached image:
> 
> http://imageshack.us/photo/my-images/197/test3r.png/
> 
> Is it possible to turn off route cache ? and see what performance will 
> be without caching
> 

Route cache cannot handle DDOS situation, since it will be filled,
unless you have a lot of memory.

I am not sure what you expected here. If caches misses are too frequent,
a cache is useless, whatever how its done.

If you disable route cache, you'll get poor performance in normal
situation (99.9999% of cases, non DDOS), and same performance on DDOS,
in 0.0001% cases

Trick to disable it is to use a big (and negative) rebuild_count

$ echo 3000000000 >/proc/sys/net/ipv4/rt_cache_rebuild_count
$ cat /proc/sys/net/ipv4/rt_cache_rebuild_count
-1294967296

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux Route Cache performance tests
  2011-11-06 17:29 ` Eric Dumazet
@ 2011-11-06 18:28   ` Paweł Staszewski
  2011-11-06 18:48     ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Paweł Staszewski @ 2011-11-06 18:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Network Development list

W dniu 2011-11-06 18:29, Eric Dumazet pisze:
> Le dimanche 06 novembre 2011 à 16:57 +0100, Paweł Staszewski a écrit :
>> Hello
>>
>>
>>
>> I make some networking performance tests for Linux 3.1
>>
>> Configuration:
>>
>> Linux (pktget) ---->  Linux (router) ---->  Linux (Sink)
>>
>> pktgen config:
>> clone_skb 32
>> pkt_size 64
>> delay 0
>>
>> pgset "flag IPDST_RND"
>> pgset "dst_min 10.0.0.0"
>> pgset "dst_max 10.18.255.255"
>> pgset "config 1"
>> pgset "flows 256"
>> pgset "flowlen 8"
>>
>> TX performance for this host:
>> eth0:            RX: 0.00 P/s      TX: 12346107.73 P/s      TOTAL:
>> 12346107.73 P/s
>>
>> On Linux (router):
>> grep . /proc/sys/net/ipv4/route/*
>> /proc/sys/net/ipv4/route/error_burst:500
>> /proc/sys/net/ipv4/route/error_cost:100
>> grep: /proc/sys/net/ipv4/route/flush: Permission denied
>> /proc/sys/net/ipv4/route/gc_elasticity:4
>> /proc/sys/net/ipv4/route/gc_interval:60
>> /proc/sys/net/ipv4/route/gc_min_interval:0
>> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
>> /proc/sys/net/ipv4/route/gc_thresh:2000000
>> /proc/sys/net/ipv4/route/gc_timeout:60
>> /proc/sys/net/ipv4/route/max_size:8388608
>> /proc/sys/net/ipv4/route/min_adv_mss:256
>> /proc/sys/net/ipv4/route/min_pmtu:552
>> /proc/sys/net/ipv4/route/mtu_expires:600
>> /proc/sys/net/ipv4/route/redirect_load:2
>> /proc/sys/net/ipv4/route/redirect_number:9
>> /proc/sys/net/ipv4/route/redirect_silence:2048
>>
>> For the first 30secs maybee more router is forwarding ~5Mpps to the
>> Linux (Sink)
>> and some stats for this forst 30secs in attached image:
>>
>> http://imageshack.us/photo/my-images/684/test1ih.png/
>>
>> Left up - pktgen linux
>> left down - Linux router (htop)
>> Right up - Linux router (bwm-ng - showing pps)
>> Right down - Linux router (lnstat)
>>
>>
>> And all is good - performance 5Mpps until Linux router will reach ~1kk
>> entries
>> What You can see on next attached image:
>>
>> http://imageshack.us/photo/my-images/24/test2id.png/
>>
>> Forwarding performance drops from 5Mpps to 1,8Mpps
>> And after 3 - 4 minutes it will stop on 0,7Mpps
>>
>>
>> After flushing the route cache performance increase from 0.7Mpps to 6Mpps
>> What You can see on next attached image:
>>
>> http://imageshack.us/photo/my-images/197/test3r.png/
>>
>> Is it possible to turn off route cache ? and see what performance will
>> be without caching
>>
> Route cache cannot handle DDOS situation, since it will be filled,
> unless you have a lot of memory.
hmm
but what is DDOS situation for route cache ? new entries per sec ? total 
amount of entries 1,2kk in my tests ?
Look sometimes in normal scenario You can hit
1245072 route cache entries
This is normal for BGP configurations.

The performance of route cache is ok to the point where we reach more 
than 1245072 entries.
Router is starting forwarding packets with 5Mpps and ends at about 
0.7Mpps when more than 1245072 entries is reached.
For my scenario
Random ip generation start at: 10.0.0.0 ends on 10.18.255.255
this is 1170450 random ip's

> I am not sure what you expected here. If caches misses are too frequent,
> a cache is useless, whatever how its done.
Yes i understand this.
> If you disable route cache, you'll get poor performance in normal
> situation (99.9999% of cases, non DDOS), and same performance on DDOS,
> in 0.0001% cases
>
> Trick to disable it is to use a big (and negative) rebuild_count
>
> $ echo 3000000000>/proc/sys/net/ipv4/rt_cache_rebuild_count
> $ cat /proc/sys/net/ipv4/rt_cache_rebuild_count
> -1294967296
Ok so disabling route cache

echo 300000000000 > /proc/sys/net/ipv4/rt_cache_rebuild_count
cat /proc/sys/net/ipv4/rt_cache_rebuild_count
-647710720


I can reach 4Mpps forwarding performance without degradation in time.
   /         iface                   Rx                   
Tx                Total
   
==============================================================================
                lo:            0.00 P/s             0.00 P/s             
0.00 P/s
              eth1:            1.00 P/s             1.00 P/s             
2.00 P/s
              eth2:            0.00 P/s       3971015.09 P/s       
3971015.09 P/s
              eth3:      3970941.17 P/s             0.00 P/s       
3970941.17 P/s
   
------------------------------------------------------------------------------
             total:      3970942.17 P/s       3971016.09 P/s       
7941958.26 P/s

lnstat -c -1 -i 1 -f rt_cache -k entries
rt_cache|
  entries|
        8|
        6|
        5|
       10|
        5|
        7|
        7|
       11|
        5|
       11|
       11|
        6|
        7|
        6|


So with disabled route cache performance is better for over 1kk route 
cache entries.
But it is static.
So i have the same performance now for generated 10k,50k,100k,1kk random 
ip's

So yes in scenarion when You count that route cache will never reach 
over 1kk entries performance with route cache enabled is 2x more that 
with disabled.
But when somehow - You will reach over 1kk entries in route cache - 
router almost stops forwarding traffic.

Thanks
Pawel

>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux Route Cache performance tests
  2011-11-06 18:28   ` Paweł Staszewski
@ 2011-11-06 18:48     ` Eric Dumazet
  2011-11-06 19:20       ` Paweł Staszewski
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-06 18:48 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

Le dimanche 06 novembre 2011 à 19:28 +0100, Paweł Staszewski a écrit :
> W dniu 2011-11-06 18:29, Eric Dumazet pisze:
> > Le dimanche 06 novembre 2011 à 16:57 +0100, Paweł Staszewski a écrit :
> >> Hello
> >>
> >>
> >>
> >> I make some networking performance tests for Linux 3.1
> >>
> >> Configuration:
> >>
> >> Linux (pktget) ---->  Linux (router) ---->  Linux (Sink)
> >>
> >> pktgen config:
> >> clone_skb 32
> >> pkt_size 64
> >> delay 0
> >>
> >> pgset "flag IPDST_RND"
> >> pgset "dst_min 10.0.0.0"
> >> pgset "dst_max 10.18.255.255"
> >> pgset "config 1"
> >> pgset "flows 256"
> >> pgset "flowlen 8"
> >>
> >> TX performance for this host:
> >> eth0:            RX: 0.00 P/s      TX: 12346107.73 P/s      TOTAL:
> >> 12346107.73 P/s
> >>
> >> On Linux (router):
> >> grep . /proc/sys/net/ipv4/route/*
> >> /proc/sys/net/ipv4/route/error_burst:500
> >> /proc/sys/net/ipv4/route/error_cost:100
> >> grep: /proc/sys/net/ipv4/route/flush: Permission denied
> >> /proc/sys/net/ipv4/route/gc_elasticity:4
> >> /proc/sys/net/ipv4/route/gc_interval:60
> >> /proc/sys/net/ipv4/route/gc_min_interval:0
> >> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
> >> /proc/sys/net/ipv4/route/gc_thresh:2000000
> >> /proc/sys/net/ipv4/route/gc_timeout:60
> >> /proc/sys/net/ipv4/route/max_size:8388608
> >> /proc/sys/net/ipv4/route/min_adv_mss:256
> >> /proc/sys/net/ipv4/route/min_pmtu:552
> >> /proc/sys/net/ipv4/route/mtu_expires:600
> >> /proc/sys/net/ipv4/route/redirect_load:2
> >> /proc/sys/net/ipv4/route/redirect_number:9
> >> /proc/sys/net/ipv4/route/redirect_silence:2048
> >>
> >> For the first 30secs maybee more router is forwarding ~5Mpps to the
> >> Linux (Sink)
> >> and some stats for this forst 30secs in attached image:
> >>
> >> http://imageshack.us/photo/my-images/684/test1ih.png/
> >>
> >> Left up - pktgen linux
> >> left down - Linux router (htop)
> >> Right up - Linux router (bwm-ng - showing pps)
> >> Right down - Linux router (lnstat)
> >>
> >>
> >> And all is good - performance 5Mpps until Linux router will reach ~1kk
> >> entries
> >> What You can see on next attached image:
> >>
> >> http://imageshack.us/photo/my-images/24/test2id.png/
> >>
> >> Forwarding performance drops from 5Mpps to 1,8Mpps
> >> And after 3 - 4 minutes it will stop on 0,7Mpps
> >>
> >>
> >> After flushing the route cache performance increase from 0.7Mpps to 6Mpps
> >> What You can see on next attached image:
> >>
> >> http://imageshack.us/photo/my-images/197/test3r.png/
> >>
> >> Is it possible to turn off route cache ? and see what performance will
> >> be without caching
> >>
> > Route cache cannot handle DDOS situation, since it will be filled,
> > unless you have a lot of memory.
> hmm
> but what is DDOS situation for route cache ? new entries per sec ? total 
> amount of entries 1,2kk in my tests ?
> Look sometimes in normal scenario You can hit
> 1245072 route cache entries
> This is normal for BGP configurations.
> 

Then figure out the right tunables for your machine ?

Its not a laptop or average server setup, so you need to allow your
kernel to consume a fair amount of memory for the route cache.

Or accept low performance :(

> The performance of route cache is ok to the point where we reach more 
> than 1245072 entries.
> Router is starting forwarding packets with 5Mpps and ends at about 
> 0.7Mpps when more than 1245072 entries is reached.
> For my scenario
> Random ip generation start at: 10.0.0.0 ends on 10.18.255.255
> this is 1170450 random ip's
> 

I have no problem with 4 millions entries in route cache, with full
performance, not 80%.


You currently have one hash table with 524288 entries
(before you changed /proc/sys/net/ipv4/route/gc_thresh)

Its not optimal for your workload, because you have many slots with 4
chained items, performance sucks.

You have to boot your machine with "rhash_entries=2097152", so that
average chain length is less than 1

Your problem is then solved :

# grep . /proc/sys/net/ipv4/route/*
/proc/sys/net/ipv4/route/error_burst:5000
/proc/sys/net/ipv4/route/error_cost:1000
/proc/sys/net/ipv4/route/gc_elasticity:8
/proc/sys/net/ipv4/route/gc_min_interval:0
/proc/sys/net/ipv4/route/gc_min_interval_ms:500
/proc/sys/net/ipv4/route/gc_thresh:2097152
/proc/sys/net/ipv4/route/gc_timeout:300
/proc/sys/net/ipv4/route/max_size:33554432
/proc/sys/net/ipv4/route/min_adv_mss:256
/proc/sys/net/ipv4/route/min_pmtu:552
/proc/sys/net/ipv4/route/mtu_expires:600
/proc/sys/net/ipv4/route/redirect_load:20
/proc/sys/net/ipv4/route/redirect_number:9
/proc/sys/net/ipv4/route/redirect_silence:20480

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux Route Cache performance tests
  2011-11-06 18:48     ` Eric Dumazet
@ 2011-11-06 19:20       ` Paweł Staszewski
  2011-11-06 19:38         ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Paweł Staszewski @ 2011-11-06 19:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Network Development list

W dniu 2011-11-06 19:48, Eric Dumazet pisze:
> Le dimanche 06 novembre 2011 à 19:28 +0100, Paweł Staszewski a écrit :
>> W dniu 2011-11-06 18:29, Eric Dumazet pisze:
>>> Le dimanche 06 novembre 2011 à 16:57 +0100, Paweł Staszewski a écrit :
>>>> Hello
>>>>
>>>>
>>>>
>>>> I make some networking performance tests for Linux 3.1
>>>>
>>>> Configuration:
>>>>
>>>> Linux (pktget) ---->   Linux (router) ---->   Linux (Sink)
>>>>
>>>> pktgen config:
>>>> clone_skb 32
>>>> pkt_size 64
>>>> delay 0
>>>>
>>>> pgset "flag IPDST_RND"
>>>> pgset "dst_min 10.0.0.0"
>>>> pgset "dst_max 10.18.255.255"
>>>> pgset "config 1"
>>>> pgset "flows 256"
>>>> pgset "flowlen 8"
>>>>
>>>> TX performance for this host:
>>>> eth0:            RX: 0.00 P/s      TX: 12346107.73 P/s      TOTAL:
>>>> 12346107.73 P/s
>>>>
>>>> On Linux (router):
>>>> grep . /proc/sys/net/ipv4/route/*
>>>> /proc/sys/net/ipv4/route/error_burst:500
>>>> /proc/sys/net/ipv4/route/error_cost:100
>>>> grep: /proc/sys/net/ipv4/route/flush: Permission denied
>>>> /proc/sys/net/ipv4/route/gc_elasticity:4
>>>> /proc/sys/net/ipv4/route/gc_interval:60
>>>> /proc/sys/net/ipv4/route/gc_min_interval:0
>>>> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
>>>> /proc/sys/net/ipv4/route/gc_thresh:2000000
>>>> /proc/sys/net/ipv4/route/gc_timeout:60
>>>> /proc/sys/net/ipv4/route/max_size:8388608
>>>> /proc/sys/net/ipv4/route/min_adv_mss:256
>>>> /proc/sys/net/ipv4/route/min_pmtu:552
>>>> /proc/sys/net/ipv4/route/mtu_expires:600
>>>> /proc/sys/net/ipv4/route/redirect_load:2
>>>> /proc/sys/net/ipv4/route/redirect_number:9
>>>> /proc/sys/net/ipv4/route/redirect_silence:2048
>>>>
>>>> For the first 30secs maybee more router is forwarding ~5Mpps to the
>>>> Linux (Sink)
>>>> and some stats for this forst 30secs in attached image:
>>>>
>>>> http://imageshack.us/photo/my-images/684/test1ih.png/
>>>>
>>>> Left up - pktgen linux
>>>> left down - Linux router (htop)
>>>> Right up - Linux router (bwm-ng - showing pps)
>>>> Right down - Linux router (lnstat)
>>>>
>>>>
>>>> And all is good - performance 5Mpps until Linux router will reach ~1kk
>>>> entries
>>>> What You can see on next attached image:
>>>>
>>>> http://imageshack.us/photo/my-images/24/test2id.png/
>>>>
>>>> Forwarding performance drops from 5Mpps to 1,8Mpps
>>>> And after 3 - 4 minutes it will stop on 0,7Mpps
>>>>
>>>>
>>>> After flushing the route cache performance increase from 0.7Mpps to 6Mpps
>>>> What You can see on next attached image:
>>>>
>>>> http://imageshack.us/photo/my-images/197/test3r.png/
>>>>
>>>> Is it possible to turn off route cache ? and see what performance will
>>>> be without caching
>>>>
>>> Route cache cannot handle DDOS situation, since it will be filled,
>>> unless you have a lot of memory.
>> hmm
>> but what is DDOS situation for route cache ? new entries per sec ? total
>> amount of entries 1,2kk in my tests ?
>> Look sometimes in normal scenario You can hit
>> 1245072 route cache entries
>> This is normal for BGP configurations.
>>
> Then figure out the right tunables for your machine ?
>
> Its not a laptop or average server setup, so you need to allow your
> kernel to consume a fair amount of memory for the route cache.
Yes this parameters  was special not tuned :)
To see what is the route cache performance limit

Because there was no optimal parameters for this test :)
no matter what i tuned results are always the same
performance drops from 5Mpps to 0.7Mpps without tuning sysctl

And with tuned parameters i can reach the same as turning off route 
cache - when running this tests.
So Yes Tuned performance is better
performance drops from 5Mpps to 0.7Mpps - without tuning
and from 5Mpps to 3,7Mpps with tuned sysctl - so a little less than with 
turned off route cache

So the point of this test was figure out how much of route cache entries 
Linux can handle without dropping performance.


> Or accept low performance :(
Never :)

>> The performance of route cache is ok to the point where we reach more
>> than 1245072 entries.
>> Router is starting forwarding packets with 5Mpps and ends at about
>> 0.7Mpps when more than 1245072 entries is reached.
>> For my scenario
>> Random ip generation start at: 10.0.0.0 ends on 10.18.255.255
>> this is 1170450 random ip's
>>
> I have no problem with 4 millions entries in route cache, with full
> performance, not 80%.
>
>
> You currently have one hash table with 524288 entries
> (before you changed /proc/sys/net/ipv4/route/gc_thresh)
>
> Its not optimal for your workload, because you have many slots with 4
> chained items, performance sucks.
>
> You have to boot your machine with "rhash_entries=2097152", so that
> average chain length is less than 1
>
> Your problem is then solved :
>
> # grep . /proc/sys/net/ipv4/route/*
> /proc/sys/net/ipv4/route/error_burst:5000
> /proc/sys/net/ipv4/route/error_cost:1000
> /proc/sys/net/ipv4/route/gc_elasticity:8
> /proc/sys/net/ipv4/route/gc_min_interval:0
> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
> /proc/sys/net/ipv4/route/gc_thresh:2097152
> /proc/sys/net/ipv4/route/gc_timeout:300
> /proc/sys/net/ipv4/route/max_size:33554432
> /proc/sys/net/ipv4/route/min_adv_mss:256
> /proc/sys/net/ipv4/route/min_pmtu:552
> /proc/sys/net/ipv4/route/mtu_expires:600
> /proc/sys/net/ipv4/route/redirect_load:20
> /proc/sys/net/ipv4/route/redirect_number:9
> /proc/sys/net/ipv4/route/redirect_silence:20480
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux Route Cache performance tests
  2011-11-06 19:20       ` Paweł Staszewski
@ 2011-11-06 19:38         ` Eric Dumazet
  2011-11-06 20:25           ` Paweł Staszewski
  2011-11-07 13:42           ` Ben Hutchings
  0 siblings, 2 replies; 32+ messages in thread
From: Eric Dumazet @ 2011-11-06 19:38 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

Le dimanche 06 novembre 2011 à 20:20 +0100, Paweł Staszewski a écrit :
> W dniu 2011-11-06 19:48, Eric Dumazet pisze:
> > Le dimanche 06 novembre 2011 à 19:28 +0100, Paweł Staszewski a écrit :
> >> W dniu 2011-11-06 18:29, Eric Dumazet pisze:
> >>> Le dimanche 06 novembre 2011 à 16:57 +0100, Paweł Staszewski a écrit :
> >>>> Hello
> >>>>
> >>>>
> >>>>
> >>>> I make some networking performance tests for Linux 3.1
> >>>>
> >>>> Configuration:
> >>>>
> >>>> Linux (pktget) ---->   Linux (router) ---->   Linux (Sink)
> >>>>
> >>>> pktgen config:
> >>>> clone_skb 32
> >>>> pkt_size 64
> >>>> delay 0
> >>>>
> >>>> pgset "flag IPDST_RND"
> >>>> pgset "dst_min 10.0.0.0"
> >>>> pgset "dst_max 10.18.255.255"
> >>>> pgset "config 1"
> >>>> pgset "flows 256"
> >>>> pgset "flowlen 8"
> >>>>
> >>>> TX performance for this host:
> >>>> eth0:            RX: 0.00 P/s      TX: 12346107.73 P/s      TOTAL:
> >>>> 12346107.73 P/s
> >>>>
> >>>> On Linux (router):
> >>>> grep . /proc/sys/net/ipv4/route/*
> >>>> /proc/sys/net/ipv4/route/error_burst:500
> >>>> /proc/sys/net/ipv4/route/error_cost:100
> >>>> grep: /proc/sys/net/ipv4/route/flush: Permission denied
> >>>> /proc/sys/net/ipv4/route/gc_elasticity:4
> >>>> /proc/sys/net/ipv4/route/gc_interval:60
> >>>> /proc/sys/net/ipv4/route/gc_min_interval:0
> >>>> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
> >>>> /proc/sys/net/ipv4/route/gc_thresh:2000000
> >>>> /proc/sys/net/ipv4/route/gc_timeout:60
> >>>> /proc/sys/net/ipv4/route/max_size:8388608
> >>>> /proc/sys/net/ipv4/route/min_adv_mss:256
> >>>> /proc/sys/net/ipv4/route/min_pmtu:552
> >>>> /proc/sys/net/ipv4/route/mtu_expires:600
> >>>> /proc/sys/net/ipv4/route/redirect_load:2
> >>>> /proc/sys/net/ipv4/route/redirect_number:9
> >>>> /proc/sys/net/ipv4/route/redirect_silence:2048
> >>>>
> >>>> For the first 30secs maybee more router is forwarding ~5Mpps to the
> >>>> Linux (Sink)
> >>>> and some stats for this forst 30secs in attached image:
> >>>>
> >>>> http://imageshack.us/photo/my-images/684/test1ih.png/
> >>>>
> >>>> Left up - pktgen linux
> >>>> left down - Linux router (htop)
> >>>> Right up - Linux router (bwm-ng - showing pps)
> >>>> Right down - Linux router (lnstat)
> >>>>
> >>>>
> >>>> And all is good - performance 5Mpps until Linux router will reach ~1kk
> >>>> entries
> >>>> What You can see on next attached image:
> >>>>
> >>>> http://imageshack.us/photo/my-images/24/test2id.png/
> >>>>
> >>>> Forwarding performance drops from 5Mpps to 1,8Mpps
> >>>> And after 3 - 4 minutes it will stop on 0,7Mpps
> >>>>
> >>>>
> >>>> After flushing the route cache performance increase from 0.7Mpps to 6Mpps
> >>>> What You can see on next attached image:
> >>>>
> >>>> http://imageshack.us/photo/my-images/197/test3r.png/
> >>>>
> >>>> Is it possible to turn off route cache ? and see what performance will
> >>>> be without caching
> >>>>
> >>> Route cache cannot handle DDOS situation, since it will be filled,
> >>> unless you have a lot of memory.
> >> hmm
> >> but what is DDOS situation for route cache ? new entries per sec ? total
> >> amount of entries 1,2kk in my tests ?
> >> Look sometimes in normal scenario You can hit
> >> 1245072 route cache entries
> >> This is normal for BGP configurations.
> >>
> > Then figure out the right tunables for your machine ?
> >
> > Its not a laptop or average server setup, so you need to allow your
> > kernel to consume a fair amount of memory for the route cache.
> Yes this parameters  was special not tuned :)
> To see what is the route cache performance limit
> 

Hmm, I thought you were asking for help on netdev ?

> Because there was no optimal parameters for this test :)
> no matter what i tuned results are always the same
> performance drops from 5Mpps to 0.7Mpps without tuning sysctl
> 
> And with tuned parameters i can reach the same as turning off route 
> cache - when running this tests.
> So Yes Tuned performance is better
> performance drops from 5Mpps to 0.7Mpps - without tuning
> and from 5Mpps to 3,7Mpps with tuned sysctl - so a little less than with 
> turned off route cache
> 
> So the point of this test was figure out how much of route cache entries 
> Linux can handle without dropping performance.

No need to even do a bench, its pretty easy to understand how a hash
table is handled.

Allowing long chains is not good.

With your 512k slots hash table, you cannot expect handling 1.4M routes
with optimal performance. End of story.

Since route hash table is allocated at boot time, only way to change its
size is using "rhash_entries=2097152" boot parameter.

If it still doesnt fly, try with "rhash_entries=4194304"

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux Route Cache performance tests
  2011-11-06 19:38         ` Eric Dumazet
@ 2011-11-06 20:25           ` Paweł Staszewski
  2011-11-06 21:26             ` Eric Dumazet
  2011-11-07 13:42           ` Ben Hutchings
  1 sibling, 1 reply; 32+ messages in thread
From: Paweł Staszewski @ 2011-11-06 20:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Network Development list

W dniu 2011-11-06 20:38, Eric Dumazet pisze:
> Le dimanche 06 novembre 2011 à 20:20 +0100, Paweł Staszewski a écrit :
>> W dniu 2011-11-06 19:48, Eric Dumazet pisze:
>>> Le dimanche 06 novembre 2011 à 19:28 +0100, Paweł Staszewski a écrit :
>>>> W dniu 2011-11-06 18:29, Eric Dumazet pisze:
>>>>> Le dimanche 06 novembre 2011 à 16:57 +0100, Paweł Staszewski a écrit :
>>>>>> Hello
>>>>>>
>>>>>>
>>>>>>
>>>>>> I make some networking performance tests for Linux 3.1
>>>>>>
>>>>>> Configuration:
>>>>>>
>>>>>> Linux (pktget) ---->    Linux (router) ---->    Linux (Sink)
>>>>>>
>>>>>> pktgen config:
>>>>>> clone_skb 32
>>>>>> pkt_size 64
>>>>>> delay 0
>>>>>>
>>>>>> pgset "flag IPDST_RND"
>>>>>> pgset "dst_min 10.0.0.0"
>>>>>> pgset "dst_max 10.18.255.255"
>>>>>> pgset "config 1"
>>>>>> pgset "flows 256"
>>>>>> pgset "flowlen 8"
>>>>>>
>>>>>> TX performance for this host:
>>>>>> eth0:            RX: 0.00 P/s      TX: 12346107.73 P/s      TOTAL:
>>>>>> 12346107.73 P/s
>>>>>>
>>>>>> On Linux (router):
>>>>>> grep . /proc/sys/net/ipv4/route/*
>>>>>> /proc/sys/net/ipv4/route/error_burst:500
>>>>>> /proc/sys/net/ipv4/route/error_cost:100
>>>>>> grep: /proc/sys/net/ipv4/route/flush: Permission denied
>>>>>> /proc/sys/net/ipv4/route/gc_elasticity:4
>>>>>> /proc/sys/net/ipv4/route/gc_interval:60
>>>>>> /proc/sys/net/ipv4/route/gc_min_interval:0
>>>>>> /proc/sys/net/ipv4/route/gc_min_interval_ms:500
>>>>>> /proc/sys/net/ipv4/route/gc_thresh:2000000
>>>>>> /proc/sys/net/ipv4/route/gc_timeout:60
>>>>>> /proc/sys/net/ipv4/route/max_size:8388608
>>>>>> /proc/sys/net/ipv4/route/min_adv_mss:256
>>>>>> /proc/sys/net/ipv4/route/min_pmtu:552
>>>>>> /proc/sys/net/ipv4/route/mtu_expires:600
>>>>>> /proc/sys/net/ipv4/route/redirect_load:2
>>>>>> /proc/sys/net/ipv4/route/redirect_number:9
>>>>>> /proc/sys/net/ipv4/route/redirect_silence:2048
>>>>>>
>>>>>> For the first 30secs maybee more router is forwarding ~5Mpps to the
>>>>>> Linux (Sink)
>>>>>> and some stats for this forst 30secs in attached image:
>>>>>>
>>>>>> http://imageshack.us/photo/my-images/684/test1ih.png/
>>>>>>
>>>>>> Left up - pktgen linux
>>>>>> left down - Linux router (htop)
>>>>>> Right up - Linux router (bwm-ng - showing pps)
>>>>>> Right down - Linux router (lnstat)
>>>>>>
>>>>>>
>>>>>> And all is good - performance 5Mpps until Linux router will reach ~1kk
>>>>>> entries
>>>>>> What You can see on next attached image:
>>>>>>
>>>>>> http://imageshack.us/photo/my-images/24/test2id.png/
>>>>>>
>>>>>> Forwarding performance drops from 5Mpps to 1,8Mpps
>>>>>> And after 3 - 4 minutes it will stop on 0,7Mpps
>>>>>>
>>>>>>
>>>>>> After flushing the route cache performance increase from 0.7Mpps to 6Mpps
>>>>>> What You can see on next attached image:
>>>>>>
>>>>>> http://imageshack.us/photo/my-images/197/test3r.png/
>>>>>>
>>>>>> Is it possible to turn off route cache ? and see what performance will
>>>>>> be without caching
>>>>>>
>>>>> Route cache cannot handle DDOS situation, since it will be filled,
>>>>> unless you have a lot of memory.
>>>> hmm
>>>> but what is DDOS situation for route cache ? new entries per sec ? total
>>>> amount of entries 1,2kk in my tests ?
>>>> Look sometimes in normal scenario You can hit
>>>> 1245072 route cache entries
>>>> This is normal for BGP configurations.
>>>>
>>> Then figure out the right tunables for your machine ?
>>>
>>> Its not a laptop or average server setup, so you need to allow your
>>> kernel to consume a fair amount of memory for the route cache.
>> Yes this parameters  was special not tuned :)
>> To see what is the route cache performance limit
>>
> Hmm, I thought you were asking for help on netdev ?
Title was tests :)
And yes maybee some help that You give me about understanding how kernel 
works with and without route cache.

>
>> Because there was no optimal parameters for this test :)
>> no matter what i tuned results are always the same
>> performance drops from 5Mpps to 0.7Mpps without tuning sysctl
>>
>> And with tuned parameters i can reach the same as turning off route
>> cache - when running this tests.
>> So Yes Tuned performance is better
>> performance drops from 5Mpps to 0.7Mpps - without tuning
>> and from 5Mpps to 3,7Mpps with tuned sysctl - so a little less than with
>> turned off route cache
>>
>> So the point of this test was figure out how much of route cache entries
>> Linux can handle without dropping performance.
> No need to even do a bench, its pretty easy to understand how a hash
> table is handled.
>
> Allowing long chains is not good.
>
> With your 512k slots hash table, you cannot expect handling 1.4M routes
> with optimal performance. End of story.
>
> Since route hash table is allocated at boot time, only way to change its
> size is using "rhash_entries=2097152" boot parameter.
>
> If it still doesnt fly, try with "rhash_entries=4194304"
Yes with this is a little problem i think with kernel 3.1 because
dmesg | egrep  '(rhash)|(route)'
[    0.000000] Command line: root=/dev/md2 rhash_entries=2097152
[    0.000000] Kernel command line: root=/dev/md2 rhash_entries=2097152
[    4.697294] IP route cache hash table entries: 524288 (order: 10, 
4194304 bytes)


Thanks
Pawel

>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux Route Cache performance tests
  2011-11-06 20:25           ` Paweł Staszewski
@ 2011-11-06 21:26             ` Eric Dumazet
  2011-11-06 21:57               ` Paweł Staszewski
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-06 21:26 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

Le dimanche 06 novembre 2011 à 21:25 +0100, Paweł Staszewski a écrit :
> Yes with this is a little problem i think with kernel 3.1 because
> dmesg | egrep  '(rhash)|(route)'
> [    0.000000] Command line: root=/dev/md2 rhash_entries=2097152
> [    0.000000] Kernel command line: root=/dev/md2 rhash_entries=2097152
> [    4.697294] IP route cache hash table entries: 524288 (order: 10, 
> 4194304 bytes)
> 
> 

Dont tell me you _still_ use a 32bit kernel ?

If so, you need to tweak alloc_large_system_hash() to use vmalloc,
because you hit MAX_ORDER (10) page allocations.

But considering LOWMEM is about 700 Mbytes, you wont be able to create a
lot of route cache entries.

Come on, do us a favor, and enter new era of computing.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux Route Cache performance tests
  2011-11-06 21:26             ` Eric Dumazet
@ 2011-11-06 21:57               ` Paweł Staszewski
  2011-11-06 23:08                 ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Paweł Staszewski @ 2011-11-06 21:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Network Development list

W dniu 2011-11-06 22:26, Eric Dumazet pisze:
> Le dimanche 06 novembre 2011 à 21:25 +0100, Paweł Staszewski a écrit :
>> Yes with this is a little problem i think with kernel 3.1 because
>> dmesg | egrep  '(rhash)|(route)'
>> [    0.000000] Command line: root=/dev/md2 rhash_entries=2097152
>> [    0.000000] Kernel command line: root=/dev/md2 rhash_entries=2097152
>> [    4.697294] IP route cache hash table entries: 524288 (order: 10,
>> 4194304 bytes)
>>
>>
> Dont tell me you _still_ use a 32bit kernel ?
no it is 64bit :)
Linux localhost 3.1.0 #16 SMP Sun Nov 6 18:09:48 CET 2011 x86_64 Intel(R)
:)

> If so, you need to tweak alloc_large_system_hash() to use vmalloc,
> because you hit MAX_ORDER (10) page allocations.
funny then :)
Maybee i turned off too many kernel features
> But considering LOWMEM is about 700 Mbytes, you wont be able to create a
> lot of route cache entries.
>
> Come on, do us a favor, and enter new era of computing.
>
>
>
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux Route Cache performance tests
  2011-11-06 21:57               ` Paweł Staszewski
@ 2011-11-06 23:08                 ` Eric Dumazet
  2011-11-07  8:36                   ` Paweł Staszewski
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-06 23:08 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

Le dimanche 06 novembre 2011 à 22:57 +0100, Paweł Staszewski a écrit :
> W dniu 2011-11-06 22:26, Eric Dumazet pisze:
> > Le dimanche 06 novembre 2011 à 21:25 +0100, Paweł Staszewski a écrit :
> >> Yes with this is a little problem i think with kernel 3.1 because
> >> dmesg | egrep  '(rhash)|(route)'
> >> [    0.000000] Command line: root=/dev/md2 rhash_entries=2097152
> >> [    0.000000] Kernel command line: root=/dev/md2 rhash_entries=2097152
> >> [    4.697294] IP route cache hash table entries: 524288 (order: 10,
> >> 4194304 bytes)
> >>
> >>
> > Dont tell me you _still_ use a 32bit kernel ?
> no it is 64bit :)
> Linux localhost 3.1.0 #16 SMP Sun Nov 6 18:09:48 CET 2011 x86_64 Intel(R)
> :)
> 
> > If so, you need to tweak alloc_large_system_hash() to use vmalloc,
> > because you hit MAX_ORDER (10) page allocations.
> funny then :)
> Maybee i turned off too many kernel features
> > But considering LOWMEM is about 700 Mbytes, you wont be able to create a
> > lot of route cache entries.
> >
> > Come on, do us a favor, and enter new era of computing.

OK, then your kernel is not CONFIG_NUMA enabled

It seems strange given you probably have a NUMA machine (24 cpus)

If so, your choices are :

1) enable CONFIG_NUMA. Really this is a must given the workload of your
machine.

2) Or : you need to add "hashdist=1" on boot params
   and patch your kernel with following patch :

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dd443d..07f86e0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5362,7 +5362,6 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
 
 int hashdist = HASHDIST_DEFAULT;
 
-#ifdef CONFIG_NUMA
 static int __init set_hashdist(char *str)
 {
 	if (!str)
@@ -5371,7 +5370,6 @@ static int __init set_hashdist(char *str)
 	return 1;
 }
 __setup("hashdist=", set_hashdist);
-#endif
 
 /*
  * allocate a large system hash table from bootmem

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Linux Route Cache performance tests
  2011-11-06 23:08                 ` Eric Dumazet
@ 2011-11-07  8:36                   ` Paweł Staszewski
  2011-11-07  9:08                     ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Paweł Staszewski @ 2011-11-07  8:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Network Development list

W dniu 2011-11-07 00:08, Eric Dumazet pisze:
> Le dimanche 06 novembre 2011 à 22:57 +0100, Paweł Staszewski a écrit :
>> W dniu 2011-11-06 22:26, Eric Dumazet pisze:
>>> Le dimanche 06 novembre 2011 à 21:25 +0100, Paweł Staszewski a écrit :
>>>> Yes with this is a little problem i think with kernel 3.1 because
>>>> dmesg | egrep  '(rhash)|(route)'
>>>> [    0.000000] Command line: root=/dev/md2 rhash_entries=2097152
>>>> [    0.000000] Kernel command line: root=/dev/md2 rhash_entries=2097152
>>>> [    4.697294] IP route cache hash table entries: 524288 (order: 10,
>>>> 4194304 bytes)
>>>>
>>>>
>>> Dont tell me you _still_ use a 32bit kernel ?
>> no it is 64bit :)
>> Linux localhost 3.1.0 #16 SMP Sun Nov 6 18:09:48 CET 2011 x86_64 Intel(R)
>> :)
>>
>>> If so, you need to tweak alloc_large_system_hash() to use vmalloc,
>>> because you hit MAX_ORDER (10) page allocations.
>> funny then :)
>> Maybee i turned off too many kernel features
>>> But considering LOWMEM is about 700 Mbytes, you wont be able to create a
>>> lot of route cache entries.
>>>
>>> Come on, do us a favor, and enter new era of computing.
> OK, then your kernel is not CONFIG_NUMA enabled
>
> It seems strange given you probably have a NUMA machine (24 cpus)
Yes NUMA was not enabled
I make some tests with NUMA and without to compare performance of ixgbe 
with use Node="" parameters for ixgbe module

> If so, your choices are :
>
> 1) enable CONFIG_NUMA. Really this is a must given the workload of your
> machine.
>
> 2) Or : you need to add "hashdist=1" on boot params
>     and patch your kernel with following patch :
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9dd443d..07f86e0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5362,7 +5362,6 @@ int percpu_pagelist_fraction_sysctl_handler(ctl_table *table, int write,
>
>   int hashdist = HASHDIST_DEFAULT;
>
> -#ifdef CONFIG_NUMA
>   static int __init set_hashdist(char *str)
>   {
>   	if (!str)
> @@ -5371,7 +5370,6 @@ static int __init set_hashdist(char *str)
>   	return 1;
>   }
>   __setup("hashdist=", set_hashdist);
> -#endif
>
>   /*
>    * allocate a large system hash table from bootmem
>
Yes after enabling NUMA I can change rhash_entries on kernel boot.

And what is the most important for big route cahce is rhash_entries
if route cache size exceed hash size performance will drop 6x to 8x
So the best settings for route cache are:
rhash_entries = gc_thresh = max_size

Eric tell me what are the plans for removing route cache from kernel ?
Because as You see with route cache performance is better
And without route cache performance is not soo good than with route 
cache enabled but it is stable for all situations even DDOS with 10kk 
random_ips

So for the feature we need to prepare for lower kernel IP forwarding 
performance because of no route cache ?
Or removing route cache will save some time in IP stack  processing ?


Thanks
Pawel


> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux Route Cache performance tests
  2011-11-07  8:36                   ` Paweł Staszewski
@ 2011-11-07  9:08                     ` Eric Dumazet
  2011-11-07  9:16                       ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-07  9:08 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

Le lundi 07 novembre 2011 à 09:36 +0100, Paweł Staszewski a écrit :

> Yes after enabling NUMA I can change rhash_entries on kernel boot.
> 
> And what is the most important for big route cahce is rhash_entries
> if route cache size exceed hash size performance will drop 6x to 8x
> So the best settings for route cache are:
> rhash_entries = gc_thresh = max_size
> 
> Eric tell me what are the plans for removing route cache from kernel ?
> Because as You see with route cache performance is better
> And without route cache performance is not soo good than with route 
> cache enabled but it is stable for all situations even DDOS with 10kk 
> random_ips
> 
> So for the feature we need to prepare for lower kernel IP forwarding 
> performance because of no route cache ?
> Or removing route cache will save some time in IP stack  processing ?
> 

Obviously, cache removal will be possible only when performance without
it is the same.

Work is in progress, it started a long time ago.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux Route Cache performance tests
  2011-11-07  9:08                     ` Eric Dumazet
@ 2011-11-07  9:16                       ` Eric Dumazet
  2011-11-07 22:12                         ` Paweł Staszewski
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-07  9:16 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

Le lundi 07 novembre 2011 à 10:08 +0100, Eric Dumazet a écrit :

> Obviously, cache removal will be possible only when performance without
> it is the same.
> 
> Work is in progress, it started a long time ago.
> 

One of the reason to get rid of this cache is its memory use.

256 bytes per entry, thats a lot of memory if you need 2.000.000
entries...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux Route Cache performance tests
  2011-11-06 19:38         ` Eric Dumazet
  2011-11-06 20:25           ` Paweł Staszewski
@ 2011-11-07 13:42           ` Ben Hutchings
  2011-11-07 14:33             ` Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: Ben Hutchings @ 2011-11-07 13:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Paweł Staszewski, Linux Network Development list

On Sun, 2011-11-06 at 20:38 +0100, Eric Dumazet wrote:
> Le dimanche 06 novembre 2011 à 20:20 +0100, Paweł Staszewski a écrit :
[...]
> > So the point of this test was figure out how much of route cache entries 
> > Linux can handle without dropping performance.
> 
> No need to even do a bench, its pretty easy to understand how a hash
> table is handled.
> 
> Allowing long chains is not good.
> 
> With your 512k slots hash table, you cannot expect handling 1.4M routes
> with optimal performance. End of story.
> 
> Since route hash table is allocated at boot time, only way to change its
> size is using "rhash_entries=2097152" boot parameter.
> 
> If it still doesnt fly, try with "rhash_entries=4194304"

A routing cache this big is not going to fit in the processor caches,
anyway; in fact even the hash table may not.  So a routing cache hit is
likely to involve processor cache misses.  After David's work to make
cacheless operation faster, I suspect that such a 'hit' can be a net
loss.  But it *is* necessary to run a benchmark to answer this (and the
answer will obviously vary between systems).

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux Route Cache performance tests
  2011-11-07 13:42           ` Ben Hutchings
@ 2011-11-07 14:33             ` Eric Dumazet
  2011-11-09 17:24               ` [PATCH net-next] ipv4: PKTINFO doesnt need dst reference Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-07 14:33 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Paweł Staszewski, Linux Network Development list

Le lundi 07 novembre 2011 à 13:42 +0000, Ben Hutchings a écrit :

> A routing cache this big is not going to fit in the processor caches,
> anyway; in fact even the hash table may not.  So a routing cache hit is
> likely to involve processor cache misses.  After David's work to make
> cacheless operation faster, I suspect that such a 'hit' can be a net
> loss.  But it *is* necessary to run a benchmark to answer this (and the
> answer will obviously vary between systems).
> 

I dont know why you think full hash table should fit processor cache.
If it does, thats perfect, but its not a requirement.

This is one cache miss, to get the pointer to the first element in
chain. Of course this might be a cache hit if several packets for a
given flow are processed in a short period of time.

Given a dst itself is 256 bytes (4 cache lines), one extra cache miss to
get the pointer to dst is not very expensive.

At least, in recent kernels we dont change dst->refcnt in forwarding
patch (usinf NOREF skb->dst)

One particular point is the atomic_inc(dst->refcnt) we have to perform
when queuing an UDP packet if socket asked PKTINFO stuff (for example a
typical DNS server has to setup this option)

I have one patch somewhere that stores the information in skb->cb[] and
avoid the atomic_{inc|dec}(dst->refcnt).

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Linux Route Cache performance tests
  2011-11-07  9:16                       ` Eric Dumazet
@ 2011-11-07 22:12                         ` Paweł Staszewski
  0 siblings, 0 replies; 32+ messages in thread
From: Paweł Staszewski @ 2011-11-07 22:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Network Development list

W dniu 2011-11-07 10:16, Eric Dumazet pisze:
> Le lundi 07 novembre 2011 à 10:08 +0100, Eric Dumazet a écrit :
>
>> Obviously, cache removal will be possible only when performance without
>> it is the same.
>>
>> Work is in progress, it started a long time ago.
>>
> One of the reason to get rid of this cache is its memory use.
>
> 256 bytes per entry, thats a lot of memory if you need 2.000.000
> entries...
>
Yes it is allot for embedded small systems
But in this times when many systems have 12 / 24 / 48GB of memory  - it 
is not too much.



>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH net-next] ipv4: PKTINFO doesnt need dst reference
  2011-11-07 14:33             ` Eric Dumazet
@ 2011-11-09 17:24               ` Eric Dumazet
  2011-11-09 21:37                 ` David Miller
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-09 17:24 UTC (permalink / raw)
  To: Ben Hutchings, David Miller
  Cc: Paweł Staszewski, Linux Network Development list

Le lundi 07 novembre 2011 à 15:33 +0100, Eric Dumazet a écrit :

> At least, in recent kernels we dont change dst->refcnt in forwarding
> patch (usinf NOREF skb->dst)
> 
> One particular point is the atomic_inc(dst->refcnt) we have to perform
> when queuing an UDP packet if socket asked PKTINFO stuff (for example a
> typical DNS server has to setup this option)
> 
> I have one patch somewhere that stores the information in skb->cb[] and
> avoid the atomic_{inc|dec}(dst->refcnt).
> 

OK I found it, I did some extra tests and believe its ready.

[PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference

When a socket uses IP_PKTINFO notifications, we currently force a dst
reference for each received skb. Reader has to access dst to get needed
information (rt_iif & rt_spec_dst) and must release dst reference.

We also forced a dst reference if skb was put in socket backlog, even
without IP_PKTINFO handling. This happens under stress/load.

We can instead store the needed information in skb->cb[], so that only
softirq handler really access dst, improving cache hit ratios.

This removes two atomic operations per packet, and false sharing as
well.

On a benchmark using a mono threaded receiver (doing only recvmsg()
calls), I can reach 720.000 pps instead of 570.000 pps.

IP_PKTINFO is typically used by DNS servers, and any multihomed aware
UDP application.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/ip.h       |    2 +-
 net/ipv4/ip_sockglue.c |   37 +++++++++++++++++++------------------
 net/ipv4/raw.c         |    3 ++-
 net/ipv4/udp.c         |    3 ++-
 net/ipv6/raw.c         |    3 ++-
 net/ipv6/udp.c         |    4 +++-
 6 files changed, 29 insertions(+), 23 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index eca0ef7..fd1561e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -450,7 +450,7 @@ extern int ip_options_rcv_srr(struct sk_buff *skb);
  *	Functions provided by ip_sockglue.c
  */
 
-extern int	ip_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
+extern void	ipv4_pktinfo_prepare(struct sk_buff *skb);
 extern void	ip_cmsg_recv(struct msghdr *msg, struct sk_buff *skb);
 extern int	ip_cmsg_send(struct net *net,
 			     struct msghdr *msg, struct ipcm_cookie *ipc);
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 09ff51b..b516030 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -55,20 +55,13 @@
 /*
  *	SOL_IP control messages.
  */
+#define PKTINFO_SKB_CB(__skb) ((struct in_pktinfo *)((__skb)->cb))
 
 static void ip_cmsg_recv_pktinfo(struct msghdr *msg, struct sk_buff *skb)
 {
-	struct in_pktinfo info;
-	struct rtable *rt = skb_rtable(skb);
-
+	struct in_pktinfo info = *PKTINFO_SKB_CB(skb);
+		
 	info.ipi_addr.s_addr = ip_hdr(skb)->daddr;
-	if (rt) {
-		info.ipi_ifindex = rt->rt_iif;
-		info.ipi_spec_dst.s_addr = rt->rt_spec_dst;
-	} else {
-		info.ipi_ifindex = 0;
-		info.ipi_spec_dst.s_addr = 0;
-	}
 
 	put_cmsg(msg, SOL_IP, IP_PKTINFO, sizeof(info), &info);
 }
@@ -992,20 +985,28 @@ e_inval:
 }
 
 /**
- * ip_queue_rcv_skb - Queue an skb into sock receive queue
+ * ipv4_pktinfo_prepare - transfert some info from rtable to skb
  * @sk: socket
  * @skb: buffer
  *
- * Queues an skb into socket receive queue. If IP_CMSG_PKTINFO option
- * is not set, we drop skb dst entry now, while dst cache line is hot.
+ * To support IP_CMSG_PKTINFO option, we store rt_iif and rt_spec_dst
+ * in skb->cb[] before dst drop.
+ * This way, receiver doesnt make cache line misses to read rtable.
  */
-int ip_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
+void ipv4_pktinfo_prepare(struct sk_buff *skb)
 {
-	if (!(inet_sk(sk)->cmsg_flags & IP_CMSG_PKTINFO))
-		skb_dst_drop(skb);
-	return sock_queue_rcv_skb(sk, skb);
+	struct in_pktinfo *pktinfo = PKTINFO_SKB_CB(skb);
+	const struct rtable *rt = skb_rtable(skb);
+
+	if (rt) {
+		pktinfo->ipi_ifindex = rt->rt_iif;
+		pktinfo->ipi_spec_dst.s_addr = rt->rt_spec_dst;
+	} else {
+		pktinfo->ipi_ifindex = 0;
+		pktinfo->ipi_spec_dst.s_addr = 0;
+	}
+	skb_dst_drop(skb);
 }
-EXPORT_SYMBOL(ip_queue_rcv_skb);
 
 int ip_setsockopt(struct sock *sk, int level,
 		int optname, char __user *optval, unsigned int optlen)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 007e2eb..7a8410d 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -292,7 +292,8 @@ static int raw_rcv_skb(struct sock * sk, struct sk_buff * skb)
 {
 	/* Charge it to the socket. */
 
-	if (ip_queue_rcv_skb(sk, skb) < 0) {
+	ipv4_pktinfo_prepare(skb);
+	if (sock_queue_rcv_skb(sk, skb) < 0) {
 		kfree_skb(skb);
 		return NET_RX_DROP;
 	}
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ab0966d..6854f58 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1357,7 +1357,7 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	if (inet_sk(sk)->inet_daddr)
 		sock_rps_save_rxhash(sk, skb);
 
-	rc = ip_queue_rcv_skb(sk, skb);
+	rc = sock_queue_rcv_skb(sk, skb);
 	if (rc < 0) {
 		int is_udplite = IS_UDPLITE(sk);
 
@@ -1473,6 +1473,7 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 
 	rc = 0;
 
+	ipv4_pktinfo_prepare(skb);
 	bh_lock_sock(sk);
 	if (!sock_owned_by_user(sk))
 		rc = __udp_queue_rcv_skb(sk, skb);
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 331af3b..204f2e8 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -383,7 +383,8 @@ static inline int rawv6_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	}
 
 	/* Charge it to the socket. */
-	if (ip_queue_rcv_skb(sk, skb) < 0) {
+	skb_dst_drop(skb);
+	if (sock_queue_rcv_skb(sk, skb) < 0) {
 		kfree_skb(skb);
 		return NET_RX_DROP;
 	}
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 846f475..b4a4a15 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -538,7 +538,9 @@ int udpv6_queue_rcv_skb(struct sock * sk, struct sk_buff *skb)
 			goto drop;
 	}
 
-	if ((rc = ip_queue_rcv_skb(sk, skb)) < 0) {
+	skb_dst_drop(skb);
+	rc = sock_queue_rcv_skb(sk, skb);
+	if (rc < 0) {
 		/* Note that an ENOMEM error is charged twice */
 		if (rc == -ENOMEM)
 			UDP6_INC_STATS_BH(sock_net(sk),

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] ipv4: PKTINFO doesnt need dst reference
  2011-11-09 17:24               ` [PATCH net-next] ipv4: PKTINFO doesnt need dst reference Eric Dumazet
@ 2011-11-09 21:37                 ` David Miller
  2011-11-09 22:03                   ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: David Miller @ 2011-11-09 21:37 UTC (permalink / raw)
  To: eric.dumazet; +Cc: bhutchings, pstaszewski, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 09 Nov 2011 18:24:35 +0100

> [PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference
> 
> When a socket uses IP_PKTINFO notifications, we currently force a dst
> reference for each received skb. Reader has to access dst to get needed
> information (rt_iif & rt_spec_dst) and must release dst reference.
> 
> We also forced a dst reference if skb was put in socket backlog, even
> without IP_PKTINFO handling. This happens under stress/load.
> 
> We can instead store the needed information in skb->cb[], so that only
> softirq handler really access dst, improving cache hit ratios.
> 
> This removes two atomic operations per packet, and false sharing as
> well.
> 
> On a benchmark using a mono threaded receiver (doing only recvmsg()
> calls), I can reach 720.000 pps instead of 570.000 pps.
> 
> IP_PKTINFO is typically used by DNS servers, and any multihomed aware
> UDP application.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Looks good, if it compiles I'll push it out to net-next :-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] ipv4: PKTINFO doesnt need dst reference
  2011-11-09 21:37                 ` David Miller
@ 2011-11-09 22:03                   ` Eric Dumazet
  2011-11-10  0:29                     ` [PATCH net-next] bnx2x: reduce skb truesize by 50% Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-09 22:03 UTC (permalink / raw)
  To: David Miller; +Cc: bhutchings, pstaszewski, netdev

Le mercredi 09 novembre 2011 à 16:37 -0500, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 09 Nov 2011 18:24:35 +0100
> 
> > [PATCH net-next] ipv4: IP_PKTINFO doesnt need dst reference
> > 
> > When a socket uses IP_PKTINFO notifications, we currently force a dst
> > reference for each received skb. Reader has to access dst to get needed
> > information (rt_iif & rt_spec_dst) and must release dst reference.
> > 
> > We also forced a dst reference if skb was put in socket backlog, even
> > without IP_PKTINFO handling. This happens under stress/load.
> > 
> > We can instead store the needed information in skb->cb[], so that only
> > softirq handler really access dst, improving cache hit ratios.
> > 
> > This removes two atomic operations per packet, and false sharing as
> > well.
> > 
> > On a benchmark using a mono threaded receiver (doing only recvmsg()
> > calls), I can reach 720.000 pps instead of 570.000 pps.
> > 
> > IP_PKTINFO is typically used by DNS servers, and any multihomed aware
> > UDP application.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Looks good, if it compiles I'll push it out to net-next :-)

Arg :(  I cross my fingers :)

BTW, on my bnx2x adapter, even small UDP frames use more than PAGE_SIZE
bytes :

skb->truesize=4352 len=26 (payload only)

Truesize being now more precise, we hit badly the shared
udp_memory_allocated, even with single frames.

I wonder if we shouldnt increase SK_MEM_QUANTUM a bit to avoid
ping/pong...

-#define SK_MEM_QUANTUM ((int)PAGE_SIZE)
+#define SK_MEM_QUANTUM ((int)PAGE_SIZE * 2)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH net-next] bnx2x: reduce skb truesize by 50%
  2011-11-09 22:03                   ` Eric Dumazet
@ 2011-11-10  0:29                     ` Eric Dumazet
  2011-11-10 15:05                       ` Eilon Greenstein
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-10  0:29 UTC (permalink / raw)
  To: David Miller; +Cc: bhutchings, pstaszewski, netdev, Eilon Greenstein

Le mercredi 09 novembre 2011 à 23:03 +0100, Eric Dumazet a écrit :

> BTW, on my bnx2x adapter, even small UDP frames use more than PAGE_SIZE
> bytes :
> 
> skb->truesize=4352 len=26 (payload only)
> 

> I wonder if we shouldnt increase SK_MEM_QUANTUM a bit to avoid
> ping/pong...
> 
> -#define SK_MEM_QUANTUM ((int)PAGE_SIZE)
> +#define SK_MEM_QUANTUM ((int)PAGE_SIZE * 2)
> 

Following patch also helps a lot, even with only two cpus (one handling
device interrupts, one running the application thread)

[PATCH net-next] bnx2x: reduce skb truesize by ~50%

bnx2x uses following formula to compute its rx_buf_sz :

dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8

Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
skb_shared_info))

Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.

Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
false sharing because of mem_reclaim in UDP stack.

One possible way to half truesize is to lower the need by 64 bytes (2112
-> 2048 bytes)

This way, skb->truesize is lower than SK_MEM_QUANTUM and we get better
performance.

(760.000 pps on a rx UDP monothread benchmark, instead of 720.000 pps)


Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index aec7212..ebbdc55 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -1185,9 +1185,14 @@ struct bnx2x {
 #define ETH_MAX_PACKET_SIZE		1500
 #define ETH_MAX_JUMBO_PACKET_SIZE	9600
 
-	/* Max supported alignment is 256 (8 shift) */
-#define BNX2X_RX_ALIGN_SHIFT		((L1_CACHE_SHIFT < 8) ? \
-					 L1_CACHE_SHIFT : 8)
+/* Max supported alignment is 256 (8 shift)
+ * It should ideally be min(L1_CACHE_SHIFT, 8)
+ * Choosing 5 (32 bytes) permits to get skb heads of 2048 bytes
+ * instead of 4096 bytes.
+ * With SLUB/SLAB allocators, data will be cache line aligned anyway.
+ */
+#define BNX2X_RX_ALIGN_SHIFT		5
+
 	/* FW use 2 Cache lines Alignment for start packet and size  */
 #define BNX2X_FW_RX_ALIGN		(2 << BNX2X_RX_ALIGN_SHIFT)
 #define BNX2X_PXP_DRAM_ALIGN		(BNX2X_RX_ALIGN_SHIFT - 5)

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] bnx2x: reduce skb truesize by 50%
  2011-11-10  0:29                     ` [PATCH net-next] bnx2x: reduce skb truesize by 50% Eric Dumazet
@ 2011-11-10 15:05                       ` Eilon Greenstein
  2011-11-10 15:27                         ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eilon Greenstein @ 2011-11-10 15:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bhutchings, pstaszewski, netdev

On Wed, 2011-11-09 at 16:29 -0800, Eric Dumazet wrote:
> Le mercredi 09 novembre 2011 à 23:03 +0100, Eric Dumazet a écrit :
> 
> > BTW, on my bnx2x adapter, even small UDP frames use more than PAGE_SIZE
> > bytes :
> > 
> > skb->truesize=4352 len=26 (payload only)
> > 
> 
> > I wonder if we shouldnt increase SK_MEM_QUANTUM a bit to avoid
> > ping/pong...
> > 
> > -#define SK_MEM_QUANTUM ((int)PAGE_SIZE)
> > +#define SK_MEM_QUANTUM ((int)PAGE_SIZE * 2)
> > 
> 
> Following patch also helps a lot, even with only two cpus (one handling
> device interrupts, one running the application thread)
> 
> [PATCH net-next] bnx2x: reduce skb truesize by ~50%
> 
> bnx2x uses following formula to compute its rx_buf_sz :
> 
> dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8
> 
> Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
> skb_shared_info))
> 
> Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
> MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.
> 
> Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
> false sharing because of mem_reclaim in UDP stack.
> 
> One possible way to half truesize is to lower the need by 64 bytes (2112
> -> 2048 bytes)
> 
> This way, skb->truesize is lower than SK_MEM_QUANTUM and we get better
> performance.
> 
> (760.000 pps on a rx UDP monothread benchmark, instead of 720.000 pps)
> 
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Eilon Greenstein <eilong@broadcom.com>
> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x.h |   11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> index aec7212..ebbdc55 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> @@ -1185,9 +1185,14 @@ struct bnx2x {
>  #define ETH_MAX_PACKET_SIZE		1500
>  #define ETH_MAX_JUMBO_PACKET_SIZE	9600
>  
> -	/* Max supported alignment is 256 (8 shift) */
> -#define BNX2X_RX_ALIGN_SHIFT		((L1_CACHE_SHIFT < 8) ? \
> -					 L1_CACHE_SHIFT : 8)
> +/* Max supported alignment is 256 (8 shift)
> + * It should ideally be min(L1_CACHE_SHIFT, 8)
> + * Choosing 5 (32 bytes) permits to get skb heads of 2048 bytes
> + * instead of 4096 bytes.
> + * With SLUB/SLAB allocators, data will be cache line aligned anyway.
> + */
> +#define BNX2X_RX_ALIGN_SHIFT		5
> +

Hi Eric,

This can seriously hurt the PCI utilization. So in scenarios in which
the PCI is the bottle neck, you will see performance degradation. We are
looking at alternatives to reduce the allocation, but it is taking a
while. Please hold off with this patch.

Thanks,
Eilon

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] bnx2x: reduce skb truesize by 50%
  2011-11-10 15:05                       ` Eilon Greenstein
@ 2011-11-10 15:27                         ` Eric Dumazet
  2011-11-10 16:27                           ` Eilon Greenstein
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-10 15:27 UTC (permalink / raw)
  To: eilong; +Cc: David Miller, bhutchings, pstaszewski, netdev

Le jeudi 10 novembre 2011 à 17:05 +0200, Eilon Greenstein a écrit :
> On Wed, 2011-11-09 at 16:29 -0800, Eric Dumazet wrote:
> > Le mercredi 09 novembre 2011 à 23:03 +0100, Eric Dumazet a écrit :
> > 
> > > BTW, on my bnx2x adapter, even small UDP frames use more than PAGE_SIZE
> > > bytes :
> > > 
> > > skb->truesize=4352 len=26 (payload only)
> > > 
> > 
> > > I wonder if we shouldnt increase SK_MEM_QUANTUM a bit to avoid
> > > ping/pong...
> > > 
> > > -#define SK_MEM_QUANTUM ((int)PAGE_SIZE)
> > > +#define SK_MEM_QUANTUM ((int)PAGE_SIZE * 2)
> > > 
> > 
> > Following patch also helps a lot, even with only two cpus (one handling
> > device interrupts, one running the application thread)
> > 
> > [PATCH net-next] bnx2x: reduce skb truesize by ~50%
> > 
> > bnx2x uses following formula to compute its rx_buf_sz :
> > 
> > dev->mtu + 2*L1_CACHE_BYTES + 14 + 8 + 8
> > 
> > Then core network adds NET_SKB_PAD and SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info))
> > 
> > Final allocated size for skb head on x86_64 (L1_CACHE_BYTES = 64,
> > MTU=1500) : 2112 bytes : SLUB/SLAB round this to 4096 bytes.
> > 
> > Since skb truesize is then bigger than SK_MEM_QUANTUM, we have lot of
> > false sharing because of mem_reclaim in UDP stack.
> > 
> > One possible way to half truesize is to lower the need by 64 bytes (2112
> > -> 2048 bytes)
> > 
> > This way, skb->truesize is lower than SK_MEM_QUANTUM and we get better
> > performance.
> > 
> > (760.000 pps on a rx UDP monothread benchmark, instead of 720.000 pps)
> > 
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > CC: Eilon Greenstein <eilong@broadcom.com>
> > ---
> >  drivers/net/ethernet/broadcom/bnx2x/bnx2x.h |   11 ++++++++---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> > index aec7212..ebbdc55 100644
> > --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> > +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> > @@ -1185,9 +1185,14 @@ struct bnx2x {
> >  #define ETH_MAX_PACKET_SIZE		1500
> >  #define ETH_MAX_JUMBO_PACKET_SIZE	9600
> >  
> > -	/* Max supported alignment is 256 (8 shift) */
> > -#define BNX2X_RX_ALIGN_SHIFT		((L1_CACHE_SHIFT < 8) ? \
> > -					 L1_CACHE_SHIFT : 8)
> > +/* Max supported alignment is 256 (8 shift)
> > + * It should ideally be min(L1_CACHE_SHIFT, 8)
> > + * Choosing 5 (32 bytes) permits to get skb heads of 2048 bytes
> > + * instead of 4096 bytes.
> > + * With SLUB/SLAB allocators, data will be cache line aligned anyway.
> > + */
> > +#define BNX2X_RX_ALIGN_SHIFT		5
> > +
> 
> Hi Eric,
> 
> This can seriously hurt the PCI utilization. So in scenarios in which
> the PCI is the bottle neck, you will see performance degradation. We are
> looking at alternatives to reduce the allocation, but it is taking a
> while. Please hold off with this patch.

What do you mean exactly ?

This patch doesnt change skb->data alignment, its still 64 bytes
aligned. (cqe_fp->placement_offset == 2). PCI utilization is the same.

Only SLOB could get a misalignement, but who uses SLOB for performance ?

Alternative would be to check why hardware need 2*L1_CACHE_BYTES extra
room for alignment... Normaly it could be 1*L1_CACHE_BYTES ?

 	/* FW use 2 Cache lines Alignment for start packet and size  */
-#define BNX2X_FW_RX_ALIGN              (2 << BNX2X_RX_ALIGN_SHIFT)
+#define BNX2X_FW_RX_ALIGN              (1 << BNX2X_RX_ALIGN_SHIFT)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] bnx2x: reduce skb truesize by 50%
  2011-11-10 15:27                         ` Eric Dumazet
@ 2011-11-10 16:27                           ` Eilon Greenstein
  2011-11-10 16:45                             ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eilon Greenstein @ 2011-11-10 16:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bhutchings, pstaszewski, netdev

On Thu, 2011-11-10 at 07:27 -0800, Eric Dumazet wrote:
> Le jeudi 10 novembre 2011 à 17:05 +0200, Eilon Greenstein a écrit :
> > > --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> > > +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> > > @@ -1185,9 +1185,14 @@ struct bnx2x {
> > >  #define ETH_MAX_PACKET_SIZE		1500
> > >  #define ETH_MAX_JUMBO_PACKET_SIZE	9600
> > >  
> > > -	/* Max supported alignment is 256 (8 shift) */
> > > -#define BNX2X_RX_ALIGN_SHIFT		((L1_CACHE_SHIFT < 8) ? \
> > > -					 L1_CACHE_SHIFT : 8)
> > > +/* Max supported alignment is 256 (8 shift)
> > > + * It should ideally be min(L1_CACHE_SHIFT, 8)
> > > + * Choosing 5 (32 bytes) permits to get skb heads of 2048 bytes
> > > + * instead of 4096 bytes.
> > > + * With SLUB/SLAB allocators, data will be cache line aligned anyway.
> > > + */
> > > +#define BNX2X_RX_ALIGN_SHIFT		5
> > > +
> > 
> > Hi Eric,
> > 
> > This can seriously hurt the PCI utilization. So in scenarios in which
> > the PCI is the bottle neck, you will see performance degradation. We are
> > looking at alternatives to reduce the allocation, but it is taking a
> > while. Please hold off with this patch.
> 
> What do you mean exactly ?
> 
> This patch doesnt change skb->data alignment, its still 64 bytes
> aligned. (cqe_fp->placement_offset == 2). PCI utilization is the same.
> 
> Only SLOB could get a misalignement, but who uses SLOB for performance ?

Obviously you are right... But the FW is configured to the wrong
alignment and that will affect the end alignment (padding) which is
significant in small packets scenarios where the PCI is the bottle neck.

> Alternative would be to check why hardware need 2*L1_CACHE_BYTES extra
> room for alignment... Normaly it could be 1*L1_CACHE_BYTES ?

Again - you are a mind reader :) This is what we are looking into right
now. The problem is that `if` the buffer is not aligned (SLOB) we can
overstep the allocated boundaries by configuring the FW to align.

>  	/* FW use 2 Cache lines Alignment for start packet and size  */
> -#define BNX2X_FW_RX_ALIGN              (2 << BNX2X_RX_ALIGN_SHIFT)
> +#define BNX2X_FW_RX_ALIGN              (1 << BNX2X_RX_ALIGN_SHIFT)
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] bnx2x: reduce skb truesize by 50%
  2011-11-10 16:27                           ` Eilon Greenstein
@ 2011-11-10 16:45                             ` Eric Dumazet
  2011-11-13 18:53                               ` Eilon Greenstein
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-10 16:45 UTC (permalink / raw)
  To: eilong; +Cc: David Miller, bhutchings, pstaszewski, netdev

Le jeudi 10 novembre 2011 à 18:27 +0200, Eilon Greenstein a écrit :
> On Thu, 2011-11-10 at 07:27 -0800, Eric Dumazet wrote:
> > Le jeudi 10 novembre 2011 à 17:05 +0200, Eilon Greenstein a écrit :
> > > > --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> > > > +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> > > > @@ -1185,9 +1185,14 @@ struct bnx2x {
> > > >  #define ETH_MAX_PACKET_SIZE		1500
> > > >  #define ETH_MAX_JUMBO_PACKET_SIZE	9600
> > > >  
> > > > -	/* Max supported alignment is 256 (8 shift) */
> > > > -#define BNX2X_RX_ALIGN_SHIFT		((L1_CACHE_SHIFT < 8) ? \
> > > > -					 L1_CACHE_SHIFT : 8)
> > > > +/* Max supported alignment is 256 (8 shift)
> > > > + * It should ideally be min(L1_CACHE_SHIFT, 8)
> > > > + * Choosing 5 (32 bytes) permits to get skb heads of 2048 bytes
> > > > + * instead of 4096 bytes.
> > > > + * With SLUB/SLAB allocators, data will be cache line aligned anyway.
> > > > + */
> > > > +#define BNX2X_RX_ALIGN_SHIFT		5
> > > > +
> > > 
> > > Hi Eric,
> > > 
> > > This can seriously hurt the PCI utilization. So in scenarios in which
> > > the PCI is the bottle neck, you will see performance degradation. We are
> > > looking at alternatives to reduce the allocation, but it is taking a
> > > while. Please hold off with this patch.
> > 
> > What do you mean exactly ?
> > 
> > This patch doesnt change skb->data alignment, its still 64 bytes
> > aligned. (cqe_fp->placement_offset == 2). PCI utilization is the same.
> > 
> > Only SLOB could get a misalignement, but who uses SLOB for performance ?
> 
> Obviously you are right... But the FW is configured to the wrong
> alignment and that will affect the end alignment (padding) which is
> significant in small packets scenarios where the PCI is the bottle neck.

Yes, I fully understand.

> 
> > Alternative would be to check why hardware need 2*L1_CACHE_BYTES extra
> > room for alignment... Normaly it could be 1*L1_CACHE_BYTES ?
> 
> Again - you are a mind reader :) This is what we are looking into right
> now. The problem is that `if` the buffer is not aligned (SLOB) we can
> overstep the allocated boundaries by configuring the FW to align.
> 
> >  	/* FW use 2 Cache lines Alignment for start packet and size  */
> > -#define BNX2X_FW_RX_ALIGN              (2 << BNX2X_RX_ALIGN_SHIFT)
> > +#define BNX2X_FW_RX_ALIGN              (1 << BNX2X_RX_ALIGN_SHIFT)
> > 
> > 

I did a SLOB test (and my patch included as well)

skb->len=66 pad=26 wkb->data=0xffff8801194da048 truesize=2304

So skb->data + pad -> 0xffff8801194da062 : So a 32bytes alignement + 2
bytes to align IP header. (BTW we dont really need it, NET_IP_ALIGN is
now 0 on most x86 platforms ?)

In the end, we get 98 bytes of 'skb reserve', and also 64 bytes of extra
headroom _after_ the end of full frame.

In my understanding, hardware alignement should be between 0 and 63, not
0 and 127.

So maybe only BNX2X_FW_RX_ALIGN is twice the needed amount.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] bnx2x: reduce skb truesize by 50%
  2011-11-10 16:45                             ` Eric Dumazet
@ 2011-11-13 18:53                               ` Eilon Greenstein
  2011-11-13 19:42                                 ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eilon Greenstein @ 2011-11-13 18:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bhutchings, pstaszewski, netdev

On Thu, 2011-11-10 at 08:45 -0800, Eric Dumazet wrote:
> Le jeudi 10 novembre 2011 à 18:27 +0200, Eilon Greenstein a écrit :
> > On Thu, 2011-11-10 at 07:27 -0800, Eric Dumazet wrote:
> > > Le jeudi 10 novembre 2011 à 17:05 +0200, Eilon Greenstein a écrit :
> > > > > --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> > > > > +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
> > > > > @@ -1185,9 +1185,14 @@ struct bnx2x {
> > > > >  #define ETH_MAX_PACKET_SIZE		1500
> > > > >  #define ETH_MAX_JUMBO_PACKET_SIZE	9600
> > > > >  
> > > > > -	/* Max supported alignment is 256 (8 shift) */
> > > > > -#define BNX2X_RX_ALIGN_SHIFT		((L1_CACHE_SHIFT < 8) ? \
> > > > > -					 L1_CACHE_SHIFT : 8)
> > > > > +/* Max supported alignment is 256 (8 shift)
> > > > > + * It should ideally be min(L1_CACHE_SHIFT, 8)
> > > > > + * Choosing 5 (32 bytes) permits to get skb heads of 2048 bytes
> > > > > + * instead of 4096 bytes.
> > > > > + * With SLUB/SLAB allocators, data will be cache line aligned anyway.
> > > > > + */
> > > > > +#define BNX2X_RX_ALIGN_SHIFT		5
> > > > > +
> > > > 
> > > > Hi Eric,
> > > > 
> > > > This can seriously hurt the PCI utilization. So in scenarios in which
> > > > the PCI is the bottle neck, you will see performance degradation. We are
> > > > looking at alternatives to reduce the allocation, but it is taking a
> > > > while. Please hold off with this patch.
> > > 
> > > What do you mean exactly ?
> > > 
> > > This patch doesnt change skb->data alignment, its still 64 bytes
> > > aligned. (cqe_fp->placement_offset == 2). PCI utilization is the same.
> > > 
> > > Only SLOB could get a misalignement, but who uses SLOB for performance ?
> > 
> > Obviously you are right... But the FW is configured to the wrong
> > alignment and that will affect the end alignment (padding) which is
> > significant in small packets scenarios where the PCI is the bottle neck.
> 
> Yes, I fully understand.
> 
> > 
> > > Alternative would be to check why hardware need 2*L1_CACHE_BYTES extra
> > > room for alignment... Normaly it could be 1*L1_CACHE_BYTES ?
> > 
> > Again - you are a mind reader :) This is what we are looking into right
> > now. The problem is that `if` the buffer is not aligned (SLOB) we can
> > overstep the allocated boundaries by configuring the FW to align.
> > 
> > >  	/* FW use 2 Cache lines Alignment for start packet and size  */
> > > -#define BNX2X_FW_RX_ALIGN              (2 << BNX2X_RX_ALIGN_SHIFT)
> > > +#define BNX2X_FW_RX_ALIGN              (1 << BNX2X_RX_ALIGN_SHIFT)
> > > 
> > > 
> 
> I did a SLOB test (and my patch included as well)
> 
> skb->len=66 pad=26 wkb->data=0xffff8801194da048 truesize=2304
> 
> So skb->data + pad -> 0xffff8801194da062 : So a 32bytes alignement + 2
> bytes to align IP header. (BTW we dont really need it, NET_IP_ALIGN is
> now 0 on most x86 platforms ?)
> 
> In the end, we get 98 bytes of 'skb reserve', and also 64 bytes of extra
> headroom _after_ the end of full frame.
> 
> In my understanding, hardware alignement should be between 0 and 63, not
> 0 and 127.

I’m not sure I’m following the math over here. Assuming L1 is 64 bytes,
we need up to 63 bytes to align the start address (assuming SLOB is
being used) and additional (up to) 63 bytes at the end. That can sum up
to 126 bytes  am I missing something?

> So maybe only BNX2X_FW_RX_ALIGN is twice the needed amount.

I agree that it does not make much sense to optimize for SLOB - after
checking with our FW expert, it seems that we can change the FW to have
two different configuration flags  for start address alignment and end
packet padding. This way, we can only set the end packet padding and add
only 64 bytes. The only down side is that the FW team is pre occupied so
this new FW will be ready for submission only in about a month.

Thanks,
Eilon

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] bnx2x: reduce skb truesize by 50%
  2011-11-13 18:53                               ` Eilon Greenstein
@ 2011-11-13 19:42                                 ` Eric Dumazet
  2011-11-13 20:08                                   ` Eilon Greenstein
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-13 19:42 UTC (permalink / raw)
  To: eilong; +Cc: David Miller, bhutchings, pstaszewski, netdev

Le dimanche 13 novembre 2011 à 20:53 +0200, Eilon Greenstein a écrit :

> I’m not sure I’m following the math over here. Assuming L1 is 64 bytes,
> we need up to 63 bytes to align the start address (assuming SLOB is
> being used) and additional (up to) 63 bytes at the end. That can sum up
> to 126 bytes  am I missing something?
> 

What do you really mean by aligning the end ?

How can both start and end of a frame can be aligned ?

If hardware needs extra room after the end of frame, then we already
have it (since we store struct skb_shared_info here)



I ran following patch and everything is fine here

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index aec7212..ddc94cc 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -1188,8 +1188,8 @@ struct bnx2x {
 	/* Max supported alignment is 256 (8 shift) */
 #define BNX2X_RX_ALIGN_SHIFT		((L1_CACHE_SHIFT < 8) ? \
 					 L1_CACHE_SHIFT : 8)
-	/* FW use 2 Cache lines Alignment for start packet and size  */
-#define BNX2X_FW_RX_ALIGN		(2 << BNX2X_RX_ALIGN_SHIFT)
+	/* FW use Cache line Alignment for start packet and size  */
+#define BNX2X_FW_RX_ALIGN		(1 << BNX2X_RX_ALIGN_SHIFT)
 #define BNX2X_PXP_DRAM_ALIGN		(BNX2X_RX_ALIGN_SHIFT - 5)
 
 	struct host_sp_status_block *def_status_blk;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] bnx2x: reduce skb truesize by 50%
  2011-11-13 19:42                                 ` Eric Dumazet
@ 2011-11-13 20:08                                   ` Eilon Greenstein
  2011-11-13 22:00                                     ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eilon Greenstein @ 2011-11-13 20:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, bhutchings, pstaszewski, netdev

On Sun, 2011-11-13 at 11:42 -0800, Eric Dumazet wrote:
> Le dimanche 13 novembre 2011 à 20:53 +0200, Eilon Greenstein a écrit :
> 
> > I’m not sure I’m following the math over here. Assuming L1 is 64 bytes,
> > we need up to 63 bytes to align the start address (assuming SLOB is
> > being used) and additional (up to) 63 bytes at the end. That can sum up
> > to 126 bytes  am I missing something?
> > 
> 
> What do you really mean by aligning the end ?

I mean padding it to full cache line.

> How can both start and end of a frame can be aligned ?

The packet will start at aligned address and (using padding) will end at
cache line boundaries.

> If hardware needs extra room after the end of frame, then we already
> have it (since we store struct skb_shared_info here)

We have some space in there, but as far as I can tell it's not up to 63
bytes, right? We will overrun the dataref.

Thanks,
Eilon

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] bnx2x: reduce skb truesize by 50%
  2011-11-13 20:08                                   ` Eilon Greenstein
@ 2011-11-13 22:00                                     ` Eric Dumazet
  2011-11-14  5:08                                       ` David Miller
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-13 22:00 UTC (permalink / raw)
  To: eilong; +Cc: David Miller, bhutchings, pstaszewski, netdev

Le dimanche 13 novembre 2011 à 22:08 +0200, Eilon Greenstein a écrit :
> On Sun, 2011-11-13 at 11:42 -0800, Eric Dumazet wrote:
> > Le dimanche 13 novembre 2011 à 20:53 +0200, Eilon Greenstein a écrit :
> > 
> > > I’m not sure I’m following the math over here. Assuming L1 is 64 bytes,
> > > we need up to 63 bytes to align the start address (assuming SLOB is
> > > being used) and additional (up to) 63 bytes at the end. That can sum up
> > > to 126 bytes  am I missing something?
> > > 
> > 
> > What do you really mean by aligning the end ?
> 
> I mean padding it to full cache line.
> 
> > How can both start and end of a frame can be aligned ?
> 
> The packet will start at aligned address and (using padding) will end at
> cache line boundaries.
> 

OK, so hardware adds up to 63 bytes of padding at the end of the packet.

> > If hardware needs extra room after the end of frame, then we already
> > have it (since we store struct skb_shared_info here)
> 
> We have some space in there, but as far as I can tell it's not up to 63
> bytes, right? We will overrun the dataref.
> 

OK then we need using build_skb() for this driver :)

http://lists.openwall.net/netdev/2011/07/11/19

This way, we build the skb_shared_info content _after_ frame is
delivered by device.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] bnx2x: reduce skb truesize by 50%
  2011-11-13 22:00                                     ` Eric Dumazet
@ 2011-11-14  5:08                                       ` David Miller
  2011-11-14  6:25                                         ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: David Miller @ 2011-11-14  5:08 UTC (permalink / raw)
  To: eric.dumazet; +Cc: eilong, bhutchings, pstaszewski, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 13 Nov 2011 23:00:57 +0100

> OK then we need using build_skb() for this driver :)
> 
> http://lists.openwall.net/netdev/2011/07/11/19
> 
> This way, we build the skb_shared_info content _after_ frame is
> delivered by device.

I fully support bringing this thing back to life :-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] bnx2x: reduce skb truesize by 50%
  2011-11-14  5:08                                       ` David Miller
@ 2011-11-14  6:25                                         ` Eric Dumazet
  2011-11-14 15:57                                           ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-14  6:25 UTC (permalink / raw)
  To: David Miller; +Cc: eilong, bhutchings, pstaszewski, netdev

Le lundi 14 novembre 2011 à 00:08 -0500, David Miller a écrit :

> I fully support bringing this thing back to life :-)

I'll make extensive tests today and provide two patches when ready, with
all performance results.

Some prefetch() calls will be removed, since build_skb() provides
already cache hot skb.

Thanks

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] bnx2x: reduce skb truesize by 50%
  2011-11-14  6:25                                         ` Eric Dumazet
@ 2011-11-14 15:57                                           ` Eric Dumazet
  2011-11-14 19:21                                             ` David Miller
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2011-11-14 15:57 UTC (permalink / raw)
  To: David Miller
  Cc: eilong, bhutchings, pstaszewski, netdev, Thomas Graf,
	Tom Herbert, Jamal Hadi Salim, Stephen Hemminger

Le lundi 14 novembre 2011 à 07:25 +0100, Eric Dumazet a écrit :
> Le lundi 14 novembre 2011 à 00:08 -0500, David Miller a écrit :
> 
> > I fully support bringing this thing back to life :-)
> 
> I'll make extensive tests today and provide two patches when ready, with
> all performance results.
> 
> Some prefetch() calls will be removed, since build_skb() provides
> already cache hot skb.

Impressive results :

before : 720.000 pps
after :  820.000 pps

[ One mono threaded application receiving UDP messages on a single
socket, asking IP_PKTINFO ancillary info ]

Latencies are also a bit improved : softirq handler dirties about 320
bytes less per skb.

Definitely worth the pain.

I am sending two patches. Other drivers probably can benefit from
build_skb() as well.

[PATCH net-next 1/2] net: introduce build_skb()
[PATCH net-next 2/2] bnx2x: uses build_skb() in receive path

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH net-next] bnx2x: reduce skb truesize by 50%
  2011-11-14 15:57                                           ` Eric Dumazet
@ 2011-11-14 19:21                                             ` David Miller
  0 siblings, 0 replies; 32+ messages in thread
From: David Miller @ 2011-11-14 19:21 UTC (permalink / raw)
  To: eric.dumazet
  Cc: eilong, bhutchings, pstaszewski, netdev, tgraf, therbert, hadi,
	shemminger

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 14 Nov 2011 16:57:45 +0100

> Le lundi 14 novembre 2011 à 07:25 +0100, Eric Dumazet a écrit :
>> Le lundi 14 novembre 2011 à 00:08 -0500, David Miller a écrit :
>> 
>> > I fully support bringing this thing back to life :-)
>> 
>> I'll make extensive tests today and provide two patches when ready, with
>> all performance results.
>> 
>> Some prefetch() calls will be removed, since build_skb() provides
>> already cache hot skb.
> 
> Impressive results :
> 
> before : 720.000 pps
> after :  820.000 pps
> 
> [ One mono threaded application receiving UDP messages on a single
> socket, asking IP_PKTINFO ancillary info ]

Sweeeeeet.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2011-11-14 19:22 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-06 15:57 Linux Route Cache performance tests Paweł Staszewski
2011-11-06 17:29 ` Eric Dumazet
2011-11-06 18:28   ` Paweł Staszewski
2011-11-06 18:48     ` Eric Dumazet
2011-11-06 19:20       ` Paweł Staszewski
2011-11-06 19:38         ` Eric Dumazet
2011-11-06 20:25           ` Paweł Staszewski
2011-11-06 21:26             ` Eric Dumazet
2011-11-06 21:57               ` Paweł Staszewski
2011-11-06 23:08                 ` Eric Dumazet
2011-11-07  8:36                   ` Paweł Staszewski
2011-11-07  9:08                     ` Eric Dumazet
2011-11-07  9:16                       ` Eric Dumazet
2011-11-07 22:12                         ` Paweł Staszewski
2011-11-07 13:42           ` Ben Hutchings
2011-11-07 14:33             ` Eric Dumazet
2011-11-09 17:24               ` [PATCH net-next] ipv4: PKTINFO doesnt need dst reference Eric Dumazet
2011-11-09 21:37                 ` David Miller
2011-11-09 22:03                   ` Eric Dumazet
2011-11-10  0:29                     ` [PATCH net-next] bnx2x: reduce skb truesize by 50% Eric Dumazet
2011-11-10 15:05                       ` Eilon Greenstein
2011-11-10 15:27                         ` Eric Dumazet
2011-11-10 16:27                           ` Eilon Greenstein
2011-11-10 16:45                             ` Eric Dumazet
2011-11-13 18:53                               ` Eilon Greenstein
2011-11-13 19:42                                 ` Eric Dumazet
2011-11-13 20:08                                   ` Eilon Greenstein
2011-11-13 22:00                                     ` Eric Dumazet
2011-11-14  5:08                                       ` David Miller
2011-11-14  6:25                                         ` Eric Dumazet
2011-11-14 15:57                                           ` Eric Dumazet
2011-11-14 19:21                                             ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).