All of lore.kernel.org
 help / color / mirror / Atom feed
* rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
@ 2009-06-25 15:48 Paweł Staszewski
  2009-06-25 21:19 ` Eric Dumazet
  0 siblings, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-25 15:48 UTC (permalink / raw)
  To: Linux Network Development list

Hello ALL

Some time ago i report this:
http://bugzilla.kernel.org/show_bug.cgi?id=6648

and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
dmesg output:
oprofile: using NMI interrupt.
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits

cat /proc/net/fib_triestat
Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes.
Main:
        Aver depth:     2.28
        Max depth:      6
        Leaves:         276539
        Prefixes:       289922
        Internal nodes: 66762
          1: 35046  2: 13824  3: 9508  4: 4897  5: 2331  6: 1149  7: 5  
9: 1  18: 1
        Pointers: 691228
Null ptrs: 347928
Total size: 35709  kB

Counters:
---------
gets = 26276593
backtracks = 547306
semantic match passed = 26188746
semantic match miss = 1117
null node hit= 27285055
skipped node resize = 0

Local:
        Aver depth:     3.33
        Max depth:      4
        Leaves:         9
        Prefixes:       10
        Internal nodes: 8
          1: 8
        Pointers: 16
Null ptrs: 0
Total size: 2  kB

Counters:
---------
gets = 26642350
backtracks = 1282818
semantic match passed = 18166
semantic match miss = 0
null node hit= 0
skipped node resize = 0



This machine is running bgpd with two bgp peers / full route table

 cat /proc/meminfo
MemTotal:       12279032 kB
MemFree:        11521920 kB
Buffers:           80288 kB
Cached:            34416 kB
SwapCached:            0 kB
Active:           286816 kB
Inactive:          82024 kB
Active(anon):     254296 kB
Inactive(anon):        0 kB
Active(file):      32520 kB
Inactive(file):    82024 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:        987988 kB
SwapFree:         987988 kB
Dirty:              1140 kB
Writeback:             0 kB
AnonPages:        254164 kB
Mapped:             5440 kB
Slab:             365084 kB
SReclaimable:      28784 kB
SUnreclaim:       336300 kB
PageTables:         2104 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     7127504 kB
Committed_AS:     267704 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       11824 kB
VmallocChunk:   34359707815 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        3392 kB
DirectMap2M:    12578816 kB


Interfaces mtu is1500


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-25 15:48 rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits Paweł Staszewski
@ 2009-06-25 21:19 ` Eric Dumazet
  2009-06-25 21:52   ` Paweł Staszewski
  2009-06-26  8:03   ` Jarek Poplawski
  0 siblings, 2 replies; 99+ messages in thread
From: Eric Dumazet @ 2009-06-25 21:19 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

Paweł Staszewski a écrit :
> Hello ALL
> 
> Some time ago i report this:
> http://bugzilla.kernel.org/show_bug.cgi?id=6648
> 
> and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
> dmesg output:
> oprofile: using NMI interrupt.
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits

Curious, you seem to hit an old alloc_pages limit()... (MAX_ORDER allocation)

Your root node has 2^18 = 262144 pointers of 8 bytes -> 2097152 bytes (+ header -> 4194304 bytes)

But since following commit, we should use vmalloc() so this PAGE_SIZE<<10) limit
should not anymore be applied.

Could you do a "cat /proc/vmallocinfo" just to check your big tnodes are vmalloced() ?


commit 15be75cdb5db442d0e33d37b20832b88f3ccd383
Author: Stephen Hemminger <shemminger@vyatta.com>
Date:   Thu Apr 10 02:56:38 2008 -0700

    IPV4: fib_trie use vmalloc for large tnodes

    Use vmalloc rather than alloc_pages to avoid wasting memory.
    The problem is that tnode structure has a power of 2 sized array,
    plus a header. So the current code wastes almost half the memory
    allocated because it always needs the next bigger size to hold
    that small header.

    This is similar to an earlier patch by Eric, but instead of a list
    and lock, I used a workqueue to handle the fact that vfree can't
    be done in interrupt context.

    Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>


> 
> cat /proc/net/fib_triestat
> Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes.
> Main:
>        Aver depth:     2.28
>        Max depth:      6
>        Leaves:         276539
>        Prefixes:       289922
>        Internal nodes: 66762
>          1: 35046  2: 13824  3: 9508  4: 4897  5: 2331  6: 1149  7: 5 
> 9: 1  18: 1
>        Pointers: 691228
> Null ptrs: 347928
> Total size: 35709  kB
> 
> Counters:
> ---------
> gets = 26276593
> backtracks = 547306
> semantic match passed = 26188746
> semantic match miss = 1117
> null node hit= 27285055
> skipped node resize = 0
> 
> Local:
>        Aver depth:     3.33
>        Max depth:      4
>        Leaves:         9
>        Prefixes:       10
>        Internal nodes: 8
>          1: 8
>        Pointers: 16
> Null ptrs: 0
> Total size: 2  kB
> 
> Counters:
> ---------
> gets = 26642350
> backtracks = 1282818
> semantic match passed = 18166
> semantic match miss = 0
> null node hit= 0
> skipped node resize = 0
> 
> 
> 
> This machine is running bgpd with two bgp peers / full route table
> 
> cat /proc/meminfo
> MemTotal:       12279032 kB
> MemFree:        11521920 kB
> Buffers:           80288 kB
> Cached:            34416 kB
> SwapCached:            0 kB
> Active:           286816 kB
> Inactive:          82024 kB
> Active(anon):     254296 kB
> Inactive(anon):        0 kB
> Active(file):      32520 kB
> Inactive(file):    82024 kB
> Unevictable:           0 kB
> Mlocked:               0 kB
> SwapTotal:        987988 kB
> SwapFree:         987988 kB
> Dirty:              1140 kB
> Writeback:             0 kB
> AnonPages:        254164 kB
> Mapped:             5440 kB
> Slab:             365084 kB
> SReclaimable:      28784 kB
> SUnreclaim:       336300 kB
> PageTables:         2104 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:     7127504 kB
> Committed_AS:     267704 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:       11824 kB
> VmallocChunk:   34359707815 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:        3392 kB
> DirectMap2M:    12578816 kB
> 
> 
> Interfaces mtu is1500


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-25 21:19 ` Eric Dumazet
@ 2009-06-25 21:52   ` Paweł Staszewski
  2009-06-25 22:54     ` Eric Dumazet
  2009-06-26  8:03   ` Jarek Poplawski
  1 sibling, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-25 21:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Network Development list


cat /proc/vmallocinfo
0xf7ffe000-0xf8000000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfe6a000 ioremap
0xf8000000-0xf8007000   28672 acpi_tb_verify_table+0x1d/0x46 
phys=dfef5000 ioremap
0xf8008000-0xf800a000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef2000 ioremap
0xf800c000-0xf800e000    8192 
acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap
0xf8010000-0xf8012000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfefb000 ioremap
0xf8014000-0xf8016000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef4000 ioremap
0xf8018000-0xf801a000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef3000 ioremap
0xf801c000-0xf801e000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef1000 ioremap
0xf8020000-0xf8022000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef0000 ioremap
0xf8024000-0xf8026000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfeef000 ioremap
0xf8028000-0xf802a000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfeee000 ioremap
0xf802c000-0xf802e000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfeed000 ioremap
0xf8030000-0xf8032000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfeec000 ioremap
0xf8038000-0xf803d000   20480 ich_force_enable_hpet+0x69/0x15a 
phys=fed1c000 ioremap
0xf803e000-0xf8040000    8192 hpet_enable+0x2a/0x21b phys=fed00000 ioremap
0xf8040000-0xf8046000   24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 ioremap
0xf8048000-0xf804a000    8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap
0xf804c000-0xf804e000    8192 e1000_probe+0x229/0xa73 phys=e1b20000 ioremap
0xf804f000-0xf8051000    8192 reiserfs_init_bitmap_cache+0x32/0x65 
pages=1 vmalloc
0xf8052000-0xf8064000   73728 journal_init+0x30/0x82a pages=17 vmalloc
0xf8065000-0xf8067000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e 
pages=1 vmalloc
0xf8068000-0xf806a000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e 
pages=1 vmalloc
0xf806b000-0xf806d000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e 
pages=1 vmalloc
0xf806e000-0xf8070000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e 
pages=1 vmalloc
0xf8071000-0xf8073000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e 
pages=1 vmalloc
0xf8080000-0xf80a1000  135168 e1000_probe+0x1ca/0xa73 phys=e1b00000 ioremap
0xf80a2000-0xf80a6000   16384 e1000e_setup_rx_resources+0x20/0xf7 
pages=3 vmalloc
0xf80a7000-0xf80ab000   16384 e1000e_setup_tx_resources+0x17/0x96 
pages=3 vmalloc
0xf80ac000-0xf80b0000   16384 e1000e_setup_rx_resources+0x20/0xf7 
pages=3 vmalloc
0xf80b1000-0xf80b5000   16384 e1000e_setup_tx_resources+0x17/0x96 
pages=3 vmalloc
0xf80c0000-0xf80e1000  135168 e1000_probe+0x1ca/0xa73 phys=e1a60000 ioremap
0xf8100000-0xf8121000  135168 e1000_probe+0x1ca/0xa73 phys=e1a20000 ioremap
0xf8122000-0xf81b3000  593920 journal_init+0x65b/0x82a pages=144 vmalloc
0xf81b4000-0xf822f000  503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc
0xf846a000-0xf856c000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc


Eric Dumazet pisze:
> Paweł Staszewski a écrit :
>   
>> Hello ALL
>>
>> Some time ago i report this:
>> http://bugzilla.kernel.org/show_bug.cgi?id=6648
>>
>> and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
>> dmesg output:
>> oprofile: using NMI interrupt.
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>>     
>
> Curious, you seem to hit an old alloc_pages limit()... (MAX_ORDER allocation)
>
> Your root node has 2^18 = 262144 pointers of 8 bytes -> 2097152 bytes (+ header -> 4194304 bytes)
>
> But since following commit, we should use vmalloc() so this PAGE_SIZE<<10) limit
> should not anymore be applied.
>
> Could you do a "cat /proc/vmallocinfo" just to check your big tnodes are vmalloced() ?
>
>
> commit 15be75cdb5db442d0e33d37b20832b88f3ccd383
> Author: Stephen Hemminger <shemminger@vyatta.com>
> Date:   Thu Apr 10 02:56:38 2008 -0700
>
>     IPV4: fib_trie use vmalloc for large tnodes
>
>     Use vmalloc rather than alloc_pages to avoid wasting memory.
>     The problem is that tnode structure has a power of 2 sized array,
>     plus a header. So the current code wastes almost half the memory
>     allocated because it always needs the next bigger size to hold
>     that small header.
>
>     This is similar to an earlier patch by Eric, but instead of a list
>     and lock, I used a workqueue to handle the fact that vfree can't
>     be done in interrupt context.
>
>     Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
>
>
>   
>> cat /proc/net/fib_triestat
>> Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes.
>> Main:
>>        Aver depth:     2.28
>>        Max depth:      6
>>        Leaves:         276539
>>        Prefixes:       289922
>>        Internal nodes: 66762
>>          1: 35046  2: 13824  3: 9508  4: 4897  5: 2331  6: 1149  7: 5 
>> 9: 1  18: 1
>>        Pointers: 691228
>> Null ptrs: 347928
>> Total size: 35709  kB
>>
>> Counters:
>> ---------
>> gets = 26276593
>> backtracks = 547306
>> semantic match passed = 26188746
>> semantic match miss = 1117
>> null node hit= 27285055
>> skipped node resize = 0
>>
>> Local:
>>        Aver depth:     3.33
>>        Max depth:      4
>>        Leaves:         9
>>        Prefixes:       10
>>        Internal nodes: 8
>>          1: 8
>>        Pointers: 16
>> Null ptrs: 0
>> Total size: 2  kB
>>
>> Counters:
>> ---------
>> gets = 26642350
>> backtracks = 1282818
>> semantic match passed = 18166
>> semantic match miss = 0
>> null node hit= 0
>> skipped node resize = 0
>>
>>
>>
>> This machine is running bgpd with two bgp peers / full route table
>>
>> cat /proc/meminfo
>> MemTotal:       12279032 kB
>> MemFree:        11521920 kB
>> Buffers:           80288 kB
>> Cached:            34416 kB
>> SwapCached:            0 kB
>> Active:           286816 kB
>> Inactive:          82024 kB
>> Active(anon):     254296 kB
>> Inactive(anon):        0 kB
>> Active(file):      32520 kB
>> Inactive(file):    82024 kB
>> Unevictable:           0 kB
>> Mlocked:               0 kB
>> SwapTotal:        987988 kB
>> SwapFree:         987988 kB
>> Dirty:              1140 kB
>> Writeback:             0 kB
>> AnonPages:        254164 kB
>> Mapped:             5440 kB
>> Slab:             365084 kB
>> SReclaimable:      28784 kB
>> SUnreclaim:       336300 kB
>> PageTables:         2104 kB
>> NFS_Unstable:          0 kB
>> Bounce:                0 kB
>> WritebackTmp:          0 kB
>> CommitLimit:     7127504 kB
>> Committed_AS:     267704 kB
>> VmallocTotal:   34359738367 kB
>> VmallocUsed:       11824 kB
>> VmallocChunk:   34359707815 kB
>> HugePages_Total:       0
>> HugePages_Free:        0
>> HugePages_Rsvd:        0
>> HugePages_Surp:        0
>> Hugepagesize:       2048 kB
>> DirectMap4k:        3392 kB
>> DirectMap2M:    12578816 kB
>>
>>
>> Interfaces mtu is1500
>>     
>
>
>
>   


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-25 21:52   ` Paweł Staszewski
@ 2009-06-25 22:54     ` Eric Dumazet
  2009-06-26 10:06       ` Paweł Staszewski
  0 siblings, 1 reply; 99+ messages in thread
From: Eric Dumazet @ 2009-06-25 22:54 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

Paweł Staszewski a écrit :
> 
> cat /proc/vmallocinfo
> 0xf7ffe000-0xf8000000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfe6a000 ioremap
> 0xf8000000-0xf8007000   28672 acpi_tb_verify_table+0x1d/0x46
> phys=dfef5000 ioremap
> 0xf8008000-0xf800a000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfef2000 ioremap
> 0xf800c000-0xf800e000    8192
> acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap
> 0xf8010000-0xf8012000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfefb000 ioremap
> 0xf8014000-0xf8016000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfef4000 ioremap
> 0xf8018000-0xf801a000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfef3000 ioremap
> 0xf801c000-0xf801e000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfef1000 ioremap
> 0xf8020000-0xf8022000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfef0000 ioremap
> 0xf8024000-0xf8026000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfeef000 ioremap
> 0xf8028000-0xf802a000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfeee000 ioremap
> 0xf802c000-0xf802e000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfeed000 ioremap
> 0xf8030000-0xf8032000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfeec000 ioremap
> 0xf8038000-0xf803d000   20480 ich_force_enable_hpet+0x69/0x15a
> phys=fed1c000 ioremap
> 0xf803e000-0xf8040000    8192 hpet_enable+0x2a/0x21b phys=fed00000 ioremap
> 0xf8040000-0xf8046000   24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 ioremap
> 0xf8048000-0xf804a000    8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap
> 0xf804c000-0xf804e000    8192 e1000_probe+0x229/0xa73 phys=e1b20000 ioremap
> 0xf804f000-0xf8051000    8192 reiserfs_init_bitmap_cache+0x32/0x65
> pages=1 vmalloc
> 0xf8052000-0xf8064000   73728 journal_init+0x30/0x82a pages=17 vmalloc
> 0xf8065000-0xf8067000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
> pages=1 vmalloc
> 0xf8068000-0xf806a000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
> pages=1 vmalloc
> 0xf806b000-0xf806d000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
> pages=1 vmalloc
> 0xf806e000-0xf8070000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
> pages=1 vmalloc
> 0xf8071000-0xf8073000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
> pages=1 vmalloc
> 0xf8080000-0xf80a1000  135168 e1000_probe+0x1ca/0xa73 phys=e1b00000 ioremap
> 0xf80a2000-0xf80a6000   16384 e1000e_setup_rx_resources+0x20/0xf7
> pages=3 vmalloc
> 0xf80a7000-0xf80ab000   16384 e1000e_setup_tx_resources+0x17/0x96
> pages=3 vmalloc
> 0xf80ac000-0xf80b0000   16384 e1000e_setup_rx_resources+0x20/0xf7
> pages=3 vmalloc
> 0xf80b1000-0xf80b5000   16384 e1000e_setup_tx_resources+0x17/0x96
> pages=3 vmalloc
> 0xf80c0000-0xf80e1000  135168 e1000_probe+0x1ca/0xa73 phys=e1a60000 ioremap
> 0xf8100000-0xf8121000  135168 e1000_probe+0x1ca/0xa73 phys=e1a20000 ioremap
> 0xf8122000-0xf81b3000  593920 journal_init+0x65b/0x82a pages=144 vmalloc
> 0xf81b4000-0xf822f000  503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc
> 0xf846a000-0xf856c000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc

This is from a 32 bit kernel.

This doesnt match your previous /proc/meminfo (from a 64bit kernel on a 12 GB machine)

Of course, I would like /proc/vmallocinfo on your loaded router, not from
a dev machine :)

> 
> 
> Eric Dumazet pisze:
>> Paweł Staszewski a écrit :
>>  
>>> Hello ALL
>>>
>>> Some time ago i report this:
>>> http://bugzilla.kernel.org/show_bug.cgi?id=6648
>>>
>>> and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
>>> dmesg output:
>>> oprofile: using NMI interrupt.
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>     
>>
>> Curious, you seem to hit an old alloc_pages limit()... (MAX_ORDER
>> allocation)
>>
>> Your root node has 2^18 = 262144 pointers of 8 bytes -> 2097152 bytes
>> (+ header -> 4194304 bytes)
>>
>> But since following commit, we should use vmalloc() so this
>> PAGE_SIZE<<10) limit
>> should not anymore be applied.
>>
>> Could you do a "cat /proc/vmallocinfo" just to check your big tnodes
>> are vmalloced() ?
>>
>>
>> commit 15be75cdb5db442d0e33d37b20832b88f3ccd383
>> Author: Stephen Hemminger <shemminger@vyatta.com>
>> Date:   Thu Apr 10 02:56:38 2008 -0700
>>
>>     IPV4: fib_trie use vmalloc for large tnodes
>>
>>     Use vmalloc rather than alloc_pages to avoid wasting memory.
>>     The problem is that tnode structure has a power of 2 sized array,
>>     plus a header. So the current code wastes almost half the memory
>>     allocated because it always needs the next bigger size to hold
>>     that small header.
>>
>>     This is similar to an earlier patch by Eric, but instead of a list
>>     and lock, I used a workqueue to handle the fact that vfree can't
>>     be done in interrupt context.
>>
>>     Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>>     Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>>
>>  
>>> cat /proc/net/fib_triestat
>>> Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes.
>>> Main:
>>>        Aver depth:     2.28
>>>        Max depth:      6
>>>        Leaves:         276539
>>>        Prefixes:       289922
>>>        Internal nodes: 66762
>>>          1: 35046  2: 13824  3: 9508  4: 4897  5: 2331  6: 1149  7: 5
>>> 9: 1  18: 1
>>>        Pointers: 691228
>>> Null ptrs: 347928
>>> Total size: 35709  kB
>>>
>>> Counters:
>>> ---------
>>> gets = 26276593
>>> backtracks = 547306
>>> semantic match passed = 26188746
>>> semantic match miss = 1117
>>> null node hit= 27285055
>>> skipped node resize = 0
>>>
>>> Local:
>>>        Aver depth:     3.33
>>>        Max depth:      4
>>>        Leaves:         9
>>>        Prefixes:       10
>>>        Internal nodes: 8
>>>          1: 8
>>>        Pointers: 16
>>> Null ptrs: 0
>>> Total size: 2  kB
>>>
>>> Counters:
>>> ---------
>>> gets = 26642350
>>> backtracks = 1282818
>>> semantic match passed = 18166
>>> semantic match miss = 0
>>> null node hit= 0
>>> skipped node resize = 0
>>>
>>>
>>>
>>> This machine is running bgpd with two bgp peers / full route table
>>>
>>> cat /proc/meminfo
>>> MemTotal:       12279032 kB
>>> MemFree:        11521920 kB
>>> Buffers:           80288 kB
>>> Cached:            34416 kB
>>> SwapCached:            0 kB
>>> Active:           286816 kB
>>> Inactive:          82024 kB
>>> Active(anon):     254296 kB
>>> Inactive(anon):        0 kB
>>> Active(file):      32520 kB
>>> Inactive(file):    82024 kB
>>> Unevictable:           0 kB
>>> Mlocked:               0 kB
>>> SwapTotal:        987988 kB
>>> SwapFree:         987988 kB
>>> Dirty:              1140 kB
>>> Writeback:             0 kB
>>> AnonPages:        254164 kB
>>> Mapped:             5440 kB
>>> Slab:             365084 kB
>>> SReclaimable:      28784 kB
>>> SUnreclaim:       336300 kB
>>> PageTables:         2104 kB
>>> NFS_Unstable:          0 kB
>>> Bounce:                0 kB
>>> WritebackTmp:          0 kB
>>> CommitLimit:     7127504 kB
>>> Committed_AS:     267704 kB
>>> VmallocTotal:   34359738367 kB
>>> VmallocUsed:       11824 kB
>>> VmallocChunk:   34359707815 kB
>>> HugePages_Total:       0
>>> HugePages_Free:        0
>>> HugePages_Rsvd:        0
>>> HugePages_Surp:        0
>>> Hugepagesize:       2048 kB
>>> DirectMap4k:        3392 kB
>>> DirectMap2M:    12578816 kB
>>>
>>>
>>> Interfaces mtu is1500
>>>     
>>
>>
>>
>>   
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-25 21:19 ` Eric Dumazet
  2009-06-25 21:52   ` Paweł Staszewski
@ 2009-06-26  8:03   ` Jarek Poplawski
  2009-06-26  9:19     ` Robert Olsson
  1 sibling, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-26  8:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: =?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On 25-06-2009 23:19, Eric Dumazet wrote:
> Pawe? Staszewski a ?crit :
>> Hello ALL
>>
>> Some time ago i report this:
>> http://bugzilla.kernel.org/show_bug.cgi?id=6648
>>
>> and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
>> dmesg output:
>> oprofile: using NMI interrupt.
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
> 
> Curious, you seem to hit an old alloc_pages limit()... (MAX_ORDER allocation)
> 
> Your root node has 2^18 = 262144 pointers of 8 bytes -> 2097152 bytes (+ header -> 4194304 bytes)
> 
> But since following commit, we should use vmalloc() so this PAGE_SIZE<<10) limit
> should not anymore be applied.
> 

On the other hand, even if there is no problem with memory, it seems
because of hitting max_resize the threshold should be changed, e.g.
by reverting the patch below.

Jarek P.


commit 965ffea43d4ebe8cd7b9fee78d651268dd7d23c5
Author: Robert Olsson <robert.olsson@its.uu.se>
Date:   Mon Mar 19 16:29:58 2007 -0700

    [IPV4]: fib_trie root node settings
    
    The threshold for root node can be more aggressive set to get
    better tree compression. The new setting mekes the root grow
    from 16 to 19 bits and substansial improvemnt in Aver depth
    this with the current table of 214393 prefixes
    
    But really the dynamic resize should need more investigation
    both in terms convergence and performance and maybe it should
    be possible to change...
    
    Maybe just for the brave to start with or we may have to back
    this out.

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 5d2b43d..9be7da7 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -292,8 +292,8 @@ static inline void check_tnode(const struct tnode *tn)
 
 static int halve_threshold = 25;
 static int inflate_threshold = 50;
-static int halve_threshold_root = 15;
-static int inflate_threshold_root = 25;
+static int halve_threshold_root = 8;
+static int inflate_threshold_root = 15;
 
 
 static void __alias_free_mem(struct rcu_head *head)

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26  8:03   ` Jarek Poplawski
@ 2009-06-26  9:19     ` Robert Olsson
  2009-06-26  9:37       ` Jarek Poplawski
  2009-06-27 19:20       ` Jarek Poplawski
  0 siblings, 2 replies; 99+ messages in thread
From: Robert Olsson @ 2009-06-26  9:19 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list


Jarek Poplawski writes:

 > >> oprofile: using NMI interrupt.
 > >> Fix inflate_threshold_root. Now=15 size=11 bits
 > >> Fix inflate_threshold_root. Now=15 size=11 bits
 > >> Fix inflate_threshold_root. Now=15 size=11 bits
 > >> Fix inflate_threshold_root. Now=15 size=11 bits
 > >> Fix inflate_threshold_root. Now=15 size=11 bits
 > >> Fix inflate_threshold_root. Now=15 size=11 bits

 > On the other hand, even if there is no problem with memory, it seems
 > because of hitting max_resize the threshold should be changed, e.g.
 > by reverting the patch below.

 You seem to have some temporary memory problem. So the printout might be
 a bit misleading in this case. We really like to keep the root node as big 
 as we can to keep the tree as flat as possible for performance reasons.
 (We're even more motivated now when we can disable the route cache)

 So I'll guess the next insert/delete inflates the root node to be within
 the interval. So I'll assume this just a temporary failure?

 I would be nice to have *threshholds* settable by /proc or /sys. I would
 use this in the other direction to trade memory for even faster lookups. 
 
 But maybe experts memory allocation has some good suggestions.

 Cheers.
					--ro
 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26  9:19     ` Robert Olsson
@ 2009-06-26  9:37       ` Jarek Poplawski
  2009-06-26 10:26         ` Jorge Boncompte [DTI2]
  2009-06-26 12:42         ` Robert Olsson
  2009-06-27 19:20       ` Jarek Poplawski
  1 sibling, 2 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-26  9:37 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On Fri, Jun 26, 2009 at 11:19:07AM +0200, Robert Olsson wrote:
> 
> Jarek Poplawski writes:
> 
>  > >> oprofile: using NMI interrupt.
>  > >> Fix inflate_threshold_root. Now=15 size=11 bits
>  > >> Fix inflate_threshold_root. Now=15 size=11 bits
>  > >> Fix inflate_threshold_root. Now=15 size=11 bits
>  > >> Fix inflate_threshold_root. Now=15 size=11 bits
>  > >> Fix inflate_threshold_root. Now=15 size=11 bits
>  > >> Fix inflate_threshold_root. Now=15 size=11 bits
> 
>  > On the other hand, even if there is no problem with memory, it seems
>  > because of hitting max_resize the threshold should be changed, e.g.
>  > by reverting the patch below.
> 
>  You seem to have some temporary memory problem. So the printout might be
>  a bit misleading in this case. We really like to keep the root node as big 
>  as we can to keep the tree as flat as possible for performance reasons.
>  (We're even more motivated now when we can disable the route cache)
> 
>  So I'll guess the next insert/delete inflates the root node to be within
>  the interval. So I'll assume this just a temporary failure?
> 
>  I would be nice to have *threshholds* settable by /proc or /sys. I would
>  use this in the other direction to trade memory for even faster lookups. 
>  
>  But maybe experts memory allocation has some good suggestions.
> 

Pawel has reported these problems for a long time:
http://bugzilla.kernel.org/show_bug.cgi?id=6648

So, until it's fully investigated, it seems some 'fast' fix is needed
here.

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-25 22:54     ` Eric Dumazet
@ 2009-06-26 10:06       ` Paweł Staszewski
  2009-06-26 10:34         ` Eric Dumazet
  0 siblings, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-26 10:06 UTC (permalink / raw)
  Cc: Linux Network Development list

Eric Dumazet pisze:
> Paweł Staszewski a écrit :
>   
>> cat /proc/vmallocinfo
>> 0xf7ffe000-0xf8000000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfe6a000 ioremap
>> 0xf8000000-0xf8007000   28672 acpi_tb_verify_table+0x1d/0x46
>> phys=dfef5000 ioremap
>> 0xf8008000-0xf800a000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfef2000 ioremap
>> 0xf800c000-0xf800e000    8192
>> acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap
>> 0xf8010000-0xf8012000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfefb000 ioremap
>> 0xf8014000-0xf8016000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfef4000 ioremap
>> 0xf8018000-0xf801a000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfef3000 ioremap
>> 0xf801c000-0xf801e000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfef1000 ioremap
>> 0xf8020000-0xf8022000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfef0000 ioremap
>> 0xf8024000-0xf8026000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfeef000 ioremap
>> 0xf8028000-0xf802a000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfeee000 ioremap
>> 0xf802c000-0xf802e000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfeed000 ioremap
>> 0xf8030000-0xf8032000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfeec000 ioremap
>> 0xf8038000-0xf803d000   20480 ich_force_enable_hpet+0x69/0x15a
>> phys=fed1c000 ioremap
>> 0xf803e000-0xf8040000    8192 hpet_enable+0x2a/0x21b phys=fed00000 ioremap
>> 0xf8040000-0xf8046000   24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 ioremap
>> 0xf8048000-0xf804a000    8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap
>> 0xf804c000-0xf804e000    8192 e1000_probe+0x229/0xa73 phys=e1b20000 ioremap
>> 0xf804f000-0xf8051000    8192 reiserfs_init_bitmap_cache+0x32/0x65
>> pages=1 vmalloc
>> 0xf8052000-0xf8064000   73728 journal_init+0x30/0x82a pages=17 vmalloc
>> 0xf8065000-0xf8067000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>> pages=1 vmalloc
>> 0xf8068000-0xf806a000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>> pages=1 vmalloc
>> 0xf806b000-0xf806d000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>> pages=1 vmalloc
>> 0xf806e000-0xf8070000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>> pages=1 vmalloc
>> 0xf8071000-0xf8073000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>> pages=1 vmalloc
>> 0xf8080000-0xf80a1000  135168 e1000_probe+0x1ca/0xa73 phys=e1b00000 ioremap
>> 0xf80a2000-0xf80a6000   16384 e1000e_setup_rx_resources+0x20/0xf7
>> pages=3 vmalloc
>> 0xf80a7000-0xf80ab000   16384 e1000e_setup_tx_resources+0x17/0x96
>> pages=3 vmalloc
>> 0xf80ac000-0xf80b0000   16384 e1000e_setup_rx_resources+0x20/0xf7
>> pages=3 vmalloc
>> 0xf80b1000-0xf80b5000   16384 e1000e_setup_tx_resources+0x17/0x96
>> pages=3 vmalloc
>> 0xf80c0000-0xf80e1000  135168 e1000_probe+0x1ca/0xa73 phys=e1a60000 ioremap
>> 0xf8100000-0xf8121000  135168 e1000_probe+0x1ca/0xa73 phys=e1a20000 ioremap
>> 0xf8122000-0xf81b3000  593920 journal_init+0x65b/0x82a pages=144 vmalloc
>> 0xf81b4000-0xf822f000  503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc
>> 0xf846a000-0xf856c000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc
>>     
>
> This is from a 32 bit kernel.
>
> This doesnt match your previous /proc/meminfo (from a 64bit kernel on a 12 GB machine)
>
> Of course, I would like /proc/vmallocinfo on your loaded router, not from
> a dev machine :)
>
>   

Yes sorry for no info about it.
I test the same kernel configurations on one 32bit machine and second 64bit

here is meminfo from this 32bit machine working on kernel 2.6.30
cat /proc/meminfo
MemTotal:        3625444 kB
MemFree:         3043648 kB
Buffers:          133968 kB
Cached:            36316 kB
SwapCached:            0 kB
Active:           256868 kB
Inactive:          76252 kB
Active(anon):     163064 kB
Inactive(anon):        0 kB
Active(file):      93804 kB
Inactive(file):    76252 kB
Unevictable:           0 kB
Mlocked:               0 kB
HighTotal:       2758160 kB
HighFree:        2556136 kB
LowTotal:         867284 kB
LowFree:          487512 kB
SwapTotal:        995896 kB
SwapFree:         995896 kB
Dirty:              3624 kB
Writeback:             0 kB
AnonPages:        162912 kB
Mapped:             3612 kB
Slab:             235888 kB
SReclaimable:      46408 kB
SUnreclaim:       189480 kB
PageTables:          384 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     2808616 kB
Committed_AS:     170648 kB
VmallocTotal:     122880 kB
VmallocUsed:        2876 kB
VmallocChunk:     109824 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       4096 kB
DirectMap4k:        8184 kB
DirectMap4M:      901120 kB
and vmallocinfo

cat /proc/vmallocinfo
0xf7ffe000-0xf8000000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfe6a000 ioremap
0xf8000000-0xf8007000   28672 acpi_tb_verify_table+0x1d/0x46 
phys=dfef5000 ioremap
0xf8008000-0xf800a000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef2000 ioremap
0xf800c000-0xf800e000    8192 
acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap
0xf8010000-0xf8012000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfefb000 ioremap
0xf8014000-0xf8016000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef4000 ioremap
0xf8018000-0xf801a000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef3000 ioremap
0xf801c000-0xf801e000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef1000 ioremap
0xf8020000-0xf8022000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfef0000 ioremap
0xf8024000-0xf8026000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfeef000 ioremap
0xf8028000-0xf802a000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfeee000 ioremap
0xf802c000-0xf802e000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfeed000 ioremap
0xf8030000-0xf8032000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=dfeec000 ioremap
0xf8038000-0xf803d000   20480 ich_force_enable_hpet+0x69/0x15a 
phys=fed1c000 ioremap
0xf803e000-0xf8040000    8192 hpet_enable+0x2a/0x21b phys=fed00000 ioremap
0xf8040000-0xf8046000   24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 ioremap
0xf8048000-0xf804a000    8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap
0xf804c000-0xf804e000    8192 e1000_probe+0x229/0xa73 phys=e1b20000 ioremap
0xf804f000-0xf8051000    8192 reiserfs_init_bitmap_cache+0x32/0x65 
pages=1 vmalloc
0xf8052000-0xf8064000   73728 journal_init+0x30/0x82a pages=17 vmalloc
0xf8065000-0xf8067000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e 
pages=1 vmalloc
0xf8068000-0xf806a000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e 
pages=1 vmalloc
0xf806b000-0xf806d000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e 
pages=1 vmalloc
0xf806e000-0xf8070000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e 
pages=1 vmalloc
0xf8071000-0xf8073000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e 
pages=1 vmalloc
0xf8080000-0xf80a1000  135168 e1000_probe+0x1ca/0xa73 phys=e1b00000 ioremap
0xf80a2000-0xf80a6000   16384 e1000e_setup_rx_resources+0x20/0xf7 
pages=3 vmalloc
0xf80a7000-0xf80ab000   16384 e1000e_setup_tx_resources+0x17/0x96 
pages=3 vmalloc
0xf80ac000-0xf80b0000   16384 e1000e_setup_rx_resources+0x20/0xf7 
pages=3 vmalloc
0xf80b1000-0xf80b5000   16384 e1000e_setup_tx_resources+0x17/0x96 
pages=3 vmalloc
0xf80c0000-0xf80e1000  135168 e1000_probe+0x1ca/0xa73 phys=e1a60000 ioremap
0xf8100000-0xf8121000  135168 e1000_probe+0x1ca/0xa73 phys=e1a20000 ioremap
0xf8122000-0xf81b3000  593920 journal_init+0x65b/0x82a pages=144 vmalloc
0xf81b4000-0xf822f000  503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc
0xf8bbc000-0xf8cbe000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc


And next  machine with kernel 2.6.29.3
dmesg:
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
Fix inflate_threshold_root. Now=15 size=11 bits
cat /proc/meminfo
MemTotal:        2072652 kB
MemFree:          496960 kB
Buffers:          267620 kB
Cached:           895212 kB
SwapCached:            0 kB
Active:           675744 kB
Inactive:         703312 kB
Active(anon):     215848 kB
Inactive(anon):        0 kB
Active(file):     459896 kB
Inactive(file):   703312 kB
Unevictable:           0 kB
Mlocked:               0 kB
HighTotal:       1186696 kB
HighFree:         151156 kB
LowTotal:         885956 kB
LowFree:          345804 kB
SwapTotal:       1975984 kB
SwapFree:        1975984 kB
Dirty:                20 kB
Writeback:             0 kB
AnonPages:        215724 kB
Mapped:             6120 kB
Slab:             186652 kB
SReclaimable:     125832 kB
SUnreclaim:        60820 kB
PageTables:          416 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     3012308 kB
Committed_AS:     223692 kB
VmallocTotal:     122880 kB
VmallocUsed:        3192 kB
VmallocChunk:     108436 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       4096 kB
DirectMap4k:        8184 kB
DirectMap4M:      901120 kB
cat /proc/vmallocinfo
0xf7ffe000-0xf8000000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=7fee0000 ioremap
0xf8000000-0xf8005000   20480 acpi_tb_verify_table+0x1d/0x46 
phys=7fee3000 ioremap
0xf8006000-0xf8008000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=7fee3000 ioremap
0xf800a000-0xf800c000    8192 acpi_tb_verify_table+0x1d/0x46 
phys=7fee6000 ioremap
0xf800d000-0xf800f000    8192 reiserfs_init_bitmap_cache+0x3b/0x80 
pages=1 vmalloc
0xf8010000-0xf8022000   73728 journal_init+0x30/0x8f0 pages=17 vmalloc
0xf8023000-0xf8025000    8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 
pages=1 vmalloc
0xf8026000-0xf8028000    8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 
pages=1 vmalloc
0xf8029000-0xf802b000    8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 
pages=1 vmalloc
0xf802c000-0xf802e000    8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 
pages=1 vmalloc
0xf802f000-0xf8031000    8192 reiserfs_allocate_list_bitmaps+0x2d/0x90 
pages=1 vmalloc
0xf803e000-0xf8040000    8192 e1000_setup_all_tx_resources+0x57/0x660 
pages=1 vmalloc
0xf8040000-0xf8061000  135168 e1000_probe+0x207/0xeb0 phys=f5000000 ioremap
0xf8062000-0xf8064000    8192 e1000_setup_all_rx_resources+0x57/0x6d0 
pages=1 vmalloc
0xf8065000-0xf8067000    8192 e1000_setup_all_tx_resources+0x57/0x660 
pages=1 vmalloc
0xf8068000-0xf806a000    8192 e1000_setup_all_rx_resources+0x57/0x6d0 
pages=1 vmalloc
0xf806b000-0xf806d000    8192 e1000_setup_all_tx_resources+0x57/0x660 
pages=1 vmalloc
0xf806e000-0xf8070000    8192 e1000_setup_all_rx_resources+0x57/0x6d0 
pages=1 vmalloc
0xf8080000-0xf80a1000  135168 e1000_probe+0x207/0xeb0 phys=f1040000 ioremap
0xf80c0000-0xf80e1000  135168 e1000_probe+0x207/0xeb0 phys=f4000000 ioremap
0xf80e2000-0xf8173000  593920 journal_init+0x56e/0x8f0 pages=144 vmalloc
0xf8174000-0xf8267000  995328 sys_swapon+0x548/0xa30 pages=242 vmalloc
0xf8d17000-0xf8e19000 1056768 tnode_new+0x7f/0x90 pages=257 vmalloc


because i have this info on 5 machines that working in ibgp mesh
And only one 64bit dev machine that is one of failover member - but i 
kill this machine  after upgrade to kernel 2.6.31-rc1
 

>> Eric Dumazet pisze:
>>     
>>> Paweł Staszewski a écrit :
>>>  
>>>       
>>>> Hello ALL
>>>>
>>>> Some time ago i report this:
>>>> http://bugzilla.kernel.org/show_bug.cgi?id=6648
>>>>
>>>> and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
>>>> dmesg output:
>>>> oprofile: using NMI interrupt.
>>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>> Fix inflate_threshold_root. Now=15 size=11 bits
>>>>     
>>>>         
>>> Curious, you seem to hit an old alloc_pages limit()... (MAX_ORDER
>>> allocation)
>>>
>>> Your root node has 2^18 = 262144 pointers of 8 bytes -> 2097152 bytes
>>> (+ header -> 4194304 bytes)
>>>
>>> But since following commit, we should use vmalloc() so this
>>> PAGE_SIZE<<10) limit
>>> should not anymore be applied.
>>>
>>> Could you do a "cat /proc/vmallocinfo" just to check your big tnodes
>>> are vmalloced() ?
>>>
>>>
>>> commit 15be75cdb5db442d0e33d37b20832b88f3ccd383
>>> Author: Stephen Hemminger <shemminger@vyatta.com>
>>> Date:   Thu Apr 10 02:56:38 2008 -0700
>>>
>>>     IPV4: fib_trie use vmalloc for large tnodes
>>>
>>>     Use vmalloc rather than alloc_pages to avoid wasting memory.
>>>     The problem is that tnode structure has a power of 2 sized array,
>>>     plus a header. So the current code wastes almost half the memory
>>>     allocated because it always needs the next bigger size to hold
>>>     that small header.
>>>
>>>     This is similar to an earlier patch by Eric, but instead of a list
>>>     and lock, I used a workqueue to handle the fact that vfree can't
>>>     be done in interrupt context.
>>>
>>>     Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>>>     Signed-off-by: David S. Miller <davem@davemloft.net>
>>>
>>>
>>>  
>>>       
>>>> cat /proc/net/fib_triestat
>>>> Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes.
>>>> Main:
>>>>        Aver depth:     2.28
>>>>        Max depth:      6
>>>>        Leaves:         276539
>>>>        Prefixes:       289922
>>>>        Internal nodes: 66762
>>>>          1: 35046  2: 13824  3: 9508  4: 4897  5: 2331  6: 1149  7: 5
>>>> 9: 1  18: 1
>>>>        Pointers: 691228
>>>> Null ptrs: 347928
>>>> Total size: 35709  kB
>>>>
>>>> Counters:
>>>> ---------
>>>> gets = 26276593
>>>> backtracks = 547306
>>>> semantic match passed = 26188746
>>>> semantic match miss = 1117
>>>> null node hit= 27285055
>>>> skipped node resize = 0
>>>>
>>>> Local:
>>>>        Aver depth:     3.33
>>>>        Max depth:      4
>>>>        Leaves:         9
>>>>        Prefixes:       10
>>>>        Internal nodes: 8
>>>>          1: 8
>>>>        Pointers: 16
>>>> Null ptrs: 0
>>>> Total size: 2  kB
>>>>
>>>> Counters:
>>>> ---------
>>>> gets = 26642350
>>>> backtracks = 1282818
>>>> semantic match passed = 18166
>>>> semantic match miss = 0
>>>> null node hit= 0
>>>> skipped node resize = 0
>>>>
>>>>
>>>>
>>>> This machine is running bgpd with two bgp peers / full route table
>>>>
>>>> cat /proc/meminfo
>>>> MemTotal:       12279032 kB
>>>> MemFree:        11521920 kB
>>>> Buffers:           80288 kB
>>>> Cached:            34416 kB
>>>> SwapCached:            0 kB
>>>> Active:           286816 kB
>>>> Inactive:          82024 kB
>>>> Active(anon):     254296 kB
>>>> Inactive(anon):        0 kB
>>>> Active(file):      32520 kB
>>>> Inactive(file):    82024 kB
>>>> Unevictable:           0 kB
>>>> Mlocked:               0 kB
>>>> SwapTotal:        987988 kB
>>>> SwapFree:         987988 kB
>>>> Dirty:              1140 kB
>>>> Writeback:             0 kB
>>>> AnonPages:        254164 kB
>>>> Mapped:             5440 kB
>>>> Slab:             365084 kB
>>>> SReclaimable:      28784 kB
>>>> SUnreclaim:       336300 kB
>>>> PageTables:         2104 kB
>>>> NFS_Unstable:          0 kB
>>>> Bounce:                0 kB
>>>> WritebackTmp:          0 kB
>>>> CommitLimit:     7127504 kB
>>>> Committed_AS:     267704 kB
>>>> VmallocTotal:   34359738367 kB
>>>> VmallocUsed:       11824 kB
>>>> VmallocChunk:   34359707815 kB
>>>> HugePages_Total:       0
>>>> HugePages_Free:        0
>>>> HugePages_Rsvd:        0
>>>> HugePages_Surp:        0
>>>> Hugepagesize:       2048 kB
>>>> DirectMap4k:        3392 kB
>>>> DirectMap2M:    12578816 kB
>>>>
>>>>
>>>> Interfaces mtu is1500
>>>>     
>>>>         
>>>
>>>   
>>>       
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>     
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>   


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26  9:37       ` Jarek Poplawski
@ 2009-06-26 10:26         ` Jorge Boncompte [DTI2]
  2009-06-26 12:42         ` Robert Olsson
  1 sibling, 0 replies; 99+ messages in thread
From: Jorge Boncompte [DTI2] @ 2009-06-26 10:26 UTC (permalink / raw)
  To: jarkao2
  Cc: Robert Olsson, Eric Dumazet, pstaszewski, Robert Olsson,
	Linux Network Development list

Jarek Poplawski escribió:
> Pawel has reported these problems for a long time:
> http://bugzilla.kernel.org/show_bug.cgi?id=6648
> 
> So, until it's fully investigated, it seems some 'fast' fix is needed
> here.

	I have never reported these problems but I am definitely seeing the same
message on kernel 2.6.29.5, usually, when one of my BGP peers goes down. So,
just a "me too".

	Regards,

		Jorge

-----------------
[ 1198.333854] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1198.437028] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1198.460848] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1199.240223] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1199.279723] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1199.383081] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1200.154893] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1200.191711] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1200.223242] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1200.270299] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1200.355795] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.239254] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.271995] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.349351] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.384676] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.428801] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.457315] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.485710] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1206.513691] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.039681] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.069224] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.108840] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.141450] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.172317] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.197824] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.224711] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.251566] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1209.289603] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1211.561178] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1211.598062] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1211.633238] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1211.684420] Fix inflate_threshold_root. Now=15 size=11 bits
[ 1216.507853] Fix inflate_threshold_root. Now=15 size=11 bits
-----------------
cat /proc/meminfo
MemTotal:         515732 kB
MemFree:          139544 kB
Buffers:            4992 kB
Cached:             8488 kB
SwapCached:            0 kB
Active:           295904 kB
Inactive:           8132 kB
Active(anon):     291716 kB
Inactive(anon):        0 kB
Active(file):       4188 kB
Inactive(file):     8132 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        290556 kB
Mapped:             2320 kB
Slab:              42392 kB
SReclaimable:       1096 kB
SUnreclaim:        41296 kB
PageTables:          512 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:      257864 kB
Committed_AS:     294496 kB
VmallocTotal:     515448 kB
VmallocUsed:        3140 kB
VmallocChunk:     501096 kB
DirectMap4k:        8128 kB
DirectMap4M:      516096 kB
-----------------
cat /proc/vmallocinfo
0xe07f0000-0xe07f5000   20480 acpi_tb_verify_table+0x20/0x4a phys=1fff0000 ioremap
0xe07f6000-0xe07f8000    8192 acpi_tb_verify_table+0x20/0x4a phys=1ffff000 ioremap
0xe07fa000-0xe07fc000    8192 acpi_tb_verify_table+0x20/0x4a phys=1fff0000 ioremap
0xe07fe000-0xe0800000    8192 acpi_tb_verify_table+0x20/0x4a phys=1fff0000 ioremap
0xe0801000-0xe080d000   49152 cramfs_uncompress_init+0x18/0x57 pages=11 vmalloc
0xe080e000-0xe0810000    8192 e100_probe+0x1db/0x471 phys=fdde0000 ioremap
0xe0812000-0xe0814000    8192 e100_probe+0x1db/0x471 phys=fdd80000 ioremap
0xe0816000-0xe0818000    8192 e100_probe+0x1db/0x471 phys=fbbf0000 ioremap
0xe081a000-0xe081c000    8192 e100_probe+0x1db/0x471 phys=fbbe0000 ioremap
0xe081e000-0xe0820000    8192 ahc_linux_pci_reserve_mem_region+0x49/0x72
[aic7xxx] phys=fe9f0000 ioremap
0xe0820000-0xe0822000    8192 module_alloc_update_bounds+0x8/0x2c pages=1 vmalloc
0xe0822000-0xe0825000   12288 module_alloc_update_bounds+0x8/0x2c pages=2 vmalloc
0xe0826000-0xe0828000    8192 module_alloc_update_bounds+0x8/0x2c pages=1 vmalloc
0xe082d000-0xe0832000   20480 module_alloc_update_bounds+0x8/0x2c pages=4 vmalloc
0xe0833000-0xe0836000   12288 module_alloc_update_bounds+0x8/0x2c pages=2 vmalloc
0xe0838000-0xe083c000   16384 module_alloc_update_bounds+0x8/0x2c pages=3 vmalloc
0xe083c000-0xe0840000   16384 module_alloc_update_bounds+0x8/0x2c pages=3 vmalloc
0xe0840000-0xe0861000  135168 e1000_probe+0x18a/0x83b phys=febc0000 ioremap
0xe0862000-0xe0871000   61440 module_alloc_update_bounds+0x8/0x2c pages=14 vmalloc
0xe0875000-0xe087f000   40960 module_alloc_update_bounds+0x8/0x2c pages=9 vmalloc
0xe0880000-0xe0889000   36864 module_alloc_update_bounds+0x8/0x2c pages=8 vmalloc
0xe088d000-0xe0897000   40960 module_alloc_update_bounds+0x8/0x2c pages=9 vmalloc
0xe0897000-0xe08b5000  122880 module_alloc_update_bounds+0x8/0x2c pages=29 vmalloc
0xe08bd000-0xe08c1000   16384 module_alloc_update_bounds+0x8/0x2c pages=3 vmalloc
0xe08c7000-0xe08c9000    8192 module_alloc_update_bounds+0x8/0x2c pages=1 vmalloc
0xe08c9000-0xe08cc000   12288 module_alloc_update_bounds+0x8/0x2c pages=2 vmalloc
0xe08d3000-0xe08d7000   16384 module_alloc_update_bounds+0x8/0x2c pages=3 vmalloc
0xe08d8000-0xe08dd000   20480 module_alloc_update_bounds+0x8/0x2c pages=4 vmalloc
0xe08de000-0xe08e5000   28672 module_alloc_update_bounds+0x8/0x2c pages=6 vmalloc
0xe08e6000-0xe08e8000    8192 module_alloc_update_bounds+0x8/0x2c pages=1 vmalloc
0xe08e8000-0xe08ed000   20480 module_alloc_update_bounds+0x8/0x2c pages=4 vmalloc
0xe08f3000-0xe08f6000   12288 module_alloc_update_bounds+0x8/0x2c pages=2 vmalloc
0xe08ff000-0xe0902000   12288 module_alloc_update_bounds+0x8/0x2c pages=2 vmalloc
0xe090e000-0xe0914000   24576 module_alloc_update_bounds+0x8/0x2c pages=5 vmalloc
0xe091e000-0xe0922000   16384 module_alloc_update_bounds+0x8/0x2c pages=3 vmalloc
0xe092e000-0xe0934000   24576 module_alloc_update_bounds+0x8/0x2c pages=5 vmalloc
0xe0935000-0xe0942000   53248 module_alloc_update_bounds+0x8/0x2c pages=12 vmalloc
0xe095d000-0xe0979000  114688 module_alloc_update_bounds+0x8/0x2c pages=27 vmalloc
0xe097e000-0xe0980000    8192 ahc_linux_pci_reserve_mem_region+0x49/0x72
[aic7xxx] phys=fe9e0000 ioremap
0xe0990000-0xe0992000    8192 module_alloc_update_bounds+0x8/0x2c pages=1 vmalloc
0xe099a000-0xe099c000    8192 module_alloc_update_bounds+0x8/0x2c pages=1 vmalloc
0xe0a00000-0xe0b01000 1052672 he_start+0x204/0x1126 [he] phys=fe800000 ioremap
0xe14bf000-0xe15c1000 1056768 tnode_new+0x18/0x48 pages=257 vmalloc
0xe15ed000-0xe15f0000   12288 xt_alloc_table_info+0x68/0x97 [x_tables] pages=2
vmalloc
0xe15f1000-0xe15f4000   12288 xt_alloc_table_info+0x68/0x97 [x_tables] pages=2
vmalloc
-----------------



^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 10:06       ` Paweł Staszewski
@ 2009-06-26 10:34         ` Eric Dumazet
  2009-06-26 10:47           ` Paweł Staszewski
  0 siblings, 1 reply; 99+ messages in thread
From: Eric Dumazet @ 2009-06-26 10:34 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

Paweł Staszewski a écrit :
> Eric Dumazet pisze:
>> Paweł Staszewski a écrit :
>>  
>>> cat /proc/vmallocinfo
>>> 0xf7ffe000-0xf8000000    8192 acpi_tb_verify_table+0x1d/0x46
>>> phys=dfe6a000 ioremap
>>> 0xf8000000-0xf8007000   28672 acpi_tb_verify_table+0x1d/0x46
>>> phys=dfef5000 ioremap
>>> 0xf8008000-0xf800a000    8192 acpi_tb_verify_table+0x1d/0x46
>>> phys=dfef2000 ioremap
>>> 0xf800c000-0xf800e000    8192
>>> acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap
>>> 0xf8010000-0xf8012000    8192 acpi_tb_verify_table+0x1d/0x46
>>> phys=dfefb000 ioremap
>>> 0xf8014000-0xf8016000    8192 acpi_tb_verify_table+0x1d/0x46
>>> phys=dfef4000 ioremap
>>> 0xf8018000-0xf801a000    8192 acpi_tb_verify_table+0x1d/0x46
>>> phys=dfef3000 ioremap
>>> 0xf801c000-0xf801e000    8192 acpi_tb_verify_table+0x1d/0x46
>>> phys=dfef1000 ioremap
>>> 0xf8020000-0xf8022000    8192 acpi_tb_verify_table+0x1d/0x46
>>> phys=dfef0000 ioremap
>>> 0xf8024000-0xf8026000    8192 acpi_tb_verify_table+0x1d/0x46
>>> phys=dfeef000 ioremap
>>> 0xf8028000-0xf802a000    8192 acpi_tb_verify_table+0x1d/0x46
>>> phys=dfeee000 ioremap
>>> 0xf802c000-0xf802e000    8192 acpi_tb_verify_table+0x1d/0x46
>>> phys=dfeed000 ioremap
>>> 0xf8030000-0xf8032000    8192 acpi_tb_verify_table+0x1d/0x46
>>> phys=dfeec000 ioremap
>>> 0xf8038000-0xf803d000   20480 ich_force_enable_hpet+0x69/0x15a
>>> phys=fed1c000 ioremap
>>> 0xf803e000-0xf8040000    8192 hpet_enable+0x2a/0x21b phys=fed00000
>>> ioremap
>>> 0xf8040000-0xf8046000   24576 alloc_iommu+0x18d/0x1d4 phys=feb00000
>>> ioremap
>>> 0xf8048000-0xf804a000    8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap
>>> 0xf804c000-0xf804e000    8192 e1000_probe+0x229/0xa73 phys=e1b20000
>>> ioremap
>>> 0xf804f000-0xf8051000    8192 reiserfs_init_bitmap_cache+0x32/0x65
>>> pages=1 vmalloc
>>> 0xf8052000-0xf8064000   73728 journal_init+0x30/0x82a pages=17 vmalloc
>>> 0xf8065000-0xf8067000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>>> pages=1 vmalloc
>>> 0xf8068000-0xf806a000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>>> pages=1 vmalloc
>>> 0xf806b000-0xf806d000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>>> pages=1 vmalloc
>>> 0xf806e000-0xf8070000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>>> pages=1 vmalloc
>>> 0xf8071000-0xf8073000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>>> pages=1 vmalloc
>>> 0xf8080000-0xf80a1000  135168 e1000_probe+0x1ca/0xa73 phys=e1b00000
>>> ioremap
>>> 0xf80a2000-0xf80a6000   16384 e1000e_setup_rx_resources+0x20/0xf7
>>> pages=3 vmalloc
>>> 0xf80a7000-0xf80ab000   16384 e1000e_setup_tx_resources+0x17/0x96
>>> pages=3 vmalloc
>>> 0xf80ac000-0xf80b0000   16384 e1000e_setup_rx_resources+0x20/0xf7
>>> pages=3 vmalloc
>>> 0xf80b1000-0xf80b5000   16384 e1000e_setup_tx_resources+0x17/0x96
>>> pages=3 vmalloc
>>> 0xf80c0000-0xf80e1000  135168 e1000_probe+0x1ca/0xa73 phys=e1a60000
>>> ioremap
>>> 0xf8100000-0xf8121000  135168 e1000_probe+0x1ca/0xa73 phys=e1a20000
>>> ioremap
>>> 0xf8122000-0xf81b3000  593920 journal_init+0x65b/0x82a pages=144 vmalloc
>>> 0xf81b4000-0xf822f000  503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc
>>> 0xf846a000-0xf856c000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc
>>>     
>>
>> This is from a 32 bit kernel.
>>
>> This doesnt match your previous /proc/meminfo (from a 64bit kernel on
>> a 12 GB machine)
>>
>> Of course, I would like /proc/vmallocinfo on your loaded router, not from
>> a dev machine :)
>>
>>   
> 
> Yes sorry for no info about it.
> I test the same kernel configurations on one 32bit machine and second 64bit
> 
> here is meminfo from this 32bit machine working on kernel 2.6.30
> cat /proc/meminfo
> MemTotal:        3625444 kB
> MemFree:         3043648 kB
> Buffers:          133968 kB
> Cached:            36316 kB
> SwapCached:            0 kB
> Active:           256868 kB
> Inactive:          76252 kB
> Active(anon):     163064 kB
> Inactive(anon):        0 kB
> Active(file):      93804 kB
> Inactive(file):    76252 kB
> Unevictable:           0 kB
> Mlocked:               0 kB
> HighTotal:       2758160 kB
> HighFree:        2556136 kB
> LowTotal:         867284 kB
> LowFree:          487512 kB
> SwapTotal:        995896 kB
> SwapFree:         995896 kB
> Dirty:              3624 kB
> Writeback:             0 kB
> AnonPages:        162912 kB
> Mapped:             3612 kB
> Slab:             235888 kB
> SReclaimable:      46408 kB
> SUnreclaim:       189480 kB
> PageTables:          384 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:     2808616 kB
> Committed_AS:     170648 kB
> VmallocTotal:     122880 kB
> VmallocUsed:        2876 kB
> VmallocChunk:     109824 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       4096 kB
> DirectMap4k:        8184 kB
> DirectMap4M:      901120 kB
> and vmallocinfo
> 
> cat /proc/vmallocinfo
> 0xf7ffe000-0xf8000000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfe6a000 ioremap
> 0xf8000000-0xf8007000   28672 acpi_tb_verify_table+0x1d/0x46
> phys=dfef5000 ioremap
> 0xf8008000-0xf800a000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfef2000 ioremap
> 0xf800c000-0xf800e000    8192
> acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap
> 0xf8010000-0xf8012000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfefb000 ioremap
> 0xf8014000-0xf8016000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfef4000 ioremap
> 0xf8018000-0xf801a000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfef3000 ioremap
> 0xf801c000-0xf801e000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfef1000 ioremap
> 0xf8020000-0xf8022000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfef0000 ioremap
> 0xf8024000-0xf8026000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfeef000 ioremap
> 0xf8028000-0xf802a000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfeee000 ioremap
> 0xf802c000-0xf802e000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfeed000 ioremap
> 0xf8030000-0xf8032000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=dfeec000 ioremap
> 0xf8038000-0xf803d000   20480 ich_force_enable_hpet+0x69/0x15a
> phys=fed1c000 ioremap
> 0xf803e000-0xf8040000    8192 hpet_enable+0x2a/0x21b phys=fed00000 ioremap
> 0xf8040000-0xf8046000   24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 ioremap
> 0xf8048000-0xf804a000    8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap
> 0xf804c000-0xf804e000    8192 e1000_probe+0x229/0xa73 phys=e1b20000 ioremap
> 0xf804f000-0xf8051000    8192 reiserfs_init_bitmap_cache+0x32/0x65
> pages=1 vmalloc
> 0xf8052000-0xf8064000   73728 journal_init+0x30/0x82a pages=17 vmalloc
> 0xf8065000-0xf8067000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
> pages=1 vmalloc
> 0xf8068000-0xf806a000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
> pages=1 vmalloc
> 0xf806b000-0xf806d000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
> pages=1 vmalloc
> 0xf806e000-0xf8070000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
> pages=1 vmalloc
> 0xf8071000-0xf8073000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
> pages=1 vmalloc
> 0xf8080000-0xf80a1000  135168 e1000_probe+0x1ca/0xa73 phys=e1b00000 ioremap
> 0xf80a2000-0xf80a6000   16384 e1000e_setup_rx_resources+0x20/0xf7
> pages=3 vmalloc
> 0xf80a7000-0xf80ab000   16384 e1000e_setup_tx_resources+0x17/0x96
> pages=3 vmalloc
> 0xf80ac000-0xf80b0000   16384 e1000e_setup_rx_resources+0x20/0xf7
> pages=3 vmalloc
> 0xf80b1000-0xf80b5000   16384 e1000e_setup_tx_resources+0x17/0x96
> pages=3 vmalloc
> 0xf80c0000-0xf80e1000  135168 e1000_probe+0x1ca/0xa73 phys=e1a60000 ioremap
> 0xf8100000-0xf8121000  135168 e1000_probe+0x1ca/0xa73 phys=e1a20000 ioremap
> 0xf8122000-0xf81b3000  593920 journal_init+0x65b/0x82a pages=144 vmalloc
> 0xf81b4000-0xf822f000  503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc
> 0xf8bbc000-0xf8cbe000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc
> 
> 
> And next  machine with kernel 2.6.29.3
> dmesg:
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> Fix inflate_threshold_root. Now=15 size=11 bits
> cat /proc/meminfo
> MemTotal:        2072652 kB
> MemFree:          496960 kB
> Buffers:          267620 kB
> Cached:           895212 kB
> SwapCached:            0 kB
> Active:           675744 kB
> Inactive:         703312 kB
> Active(anon):     215848 kB
> Inactive(anon):        0 kB
> Active(file):     459896 kB
> Inactive(file):   703312 kB
> Unevictable:           0 kB
> Mlocked:               0 kB
> HighTotal:       1186696 kB
> HighFree:         151156 kB
> LowTotal:         885956 kB
> LowFree:          345804 kB
> SwapTotal:       1975984 kB
> SwapFree:        1975984 kB
> Dirty:                20 kB
> Writeback:             0 kB
> AnonPages:        215724 kB
> Mapped:             6120 kB
> Slab:             186652 kB
> SReclaimable:     125832 kB
> SUnreclaim:        60820 kB
> PageTables:          416 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:     3012308 kB
> Committed_AS:     223692 kB
> VmallocTotal:     122880 kB
> VmallocUsed:        3192 kB
> VmallocChunk:     108436 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       4096 kB
> DirectMap4k:        8184 kB
> DirectMap4M:      901120 kB
> cat /proc/vmallocinfo
> 0xf7ffe000-0xf8000000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=7fee0000 ioremap
> 0xf8000000-0xf8005000   20480 acpi_tb_verify_table+0x1d/0x46
> phys=7fee3000 ioremap
> 0xf8006000-0xf8008000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=7fee3000 ioremap
> 0xf800a000-0xf800c000    8192 acpi_tb_verify_table+0x1d/0x46
> phys=7fee6000 ioremap
> 0xf800d000-0xf800f000    8192 reiserfs_init_bitmap_cache+0x3b/0x80
> pages=1 vmalloc
> 0xf8010000-0xf8022000   73728 journal_init+0x30/0x8f0 pages=17 vmalloc
> 0xf8023000-0xf8025000    8192 reiserfs_allocate_list_bitmaps+0x2d/0x90
> pages=1 vmalloc
> 0xf8026000-0xf8028000    8192 reiserfs_allocate_list_bitmaps+0x2d/0x90
> pages=1 vmalloc
> 0xf8029000-0xf802b000    8192 reiserfs_allocate_list_bitmaps+0x2d/0x90
> pages=1 vmalloc
> 0xf802c000-0xf802e000    8192 reiserfs_allocate_list_bitmaps+0x2d/0x90
> pages=1 vmalloc
> 0xf802f000-0xf8031000    8192 reiserfs_allocate_list_bitmaps+0x2d/0x90
> pages=1 vmalloc
> 0xf803e000-0xf8040000    8192 e1000_setup_all_tx_resources+0x57/0x660
> pages=1 vmalloc
> 0xf8040000-0xf8061000  135168 e1000_probe+0x207/0xeb0 phys=f5000000 ioremap
> 0xf8062000-0xf8064000    8192 e1000_setup_all_rx_resources+0x57/0x6d0
> pages=1 vmalloc
> 0xf8065000-0xf8067000    8192 e1000_setup_all_tx_resources+0x57/0x660
> pages=1 vmalloc
> 0xf8068000-0xf806a000    8192 e1000_setup_all_rx_resources+0x57/0x6d0
> pages=1 vmalloc
> 0xf806b000-0xf806d000    8192 e1000_setup_all_tx_resources+0x57/0x660
> pages=1 vmalloc
> 0xf806e000-0xf8070000    8192 e1000_setup_all_rx_resources+0x57/0x6d0
> pages=1 vmalloc
> 0xf8080000-0xf80a1000  135168 e1000_probe+0x207/0xeb0 phys=f1040000 ioremap
> 0xf80c0000-0xf80e1000  135168 e1000_probe+0x207/0xeb0 phys=f4000000 ioremap
> 0xf80e2000-0xf8173000  593920 journal_init+0x56e/0x8f0 pages=144 vmalloc
> 0xf8174000-0xf8267000  995328 sys_swapon+0x548/0xa30 pages=242 vmalloc
> 0xf8d17000-0xf8e19000 1056768 tnode_new+0x7f/0x90 pages=257 vmalloc
> 
> 
> because i have this info on 5 machines that working in ibgp mesh
> And only one 64bit dev machine that is one of failover member - but i
> kill this machine  after upgrade to kernel 2.6.31-rc1

Yes, I was a fool to ask you to try 2.6.31-rc1, sorry.

Even 2.6.30 is too young for a production machine.

2.6.29.5 contains the fixes, Pawel, did you tried this version ?

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 10:34         ` Eric Dumazet
@ 2009-06-26 10:47           ` Paweł Staszewski
  2009-06-26 10:52             ` Eric Dumazet
  0 siblings, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-26 10:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Network Development list

Eric Dumazet pisze:
> Paweł Staszewski a écrit :
>   
>> Eric Dumazet pisze:
>>     
>>> Paweł Staszewski a écrit :
>>>  
>>>       
>>>> cat /proc/vmallocinfo
>>>> 0xf7ffe000-0xf8000000    8192 acpi_tb_verify_table+0x1d/0x46
>>>> phys=dfe6a000 ioremap
>>>> 0xf8000000-0xf8007000   28672 acpi_tb_verify_table+0x1d/0x46
>>>> phys=dfef5000 ioremap
>>>> 0xf8008000-0xf800a000    8192 acpi_tb_verify_table+0x1d/0x46
>>>> phys=dfef2000 ioremap
>>>> 0xf800c000-0xf800e000    8192
>>>> acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap
>>>> 0xf8010000-0xf8012000    8192 acpi_tb_verify_table+0x1d/0x46
>>>> phys=dfefb000 ioremap
>>>> 0xf8014000-0xf8016000    8192 acpi_tb_verify_table+0x1d/0x46
>>>> phys=dfef4000 ioremap
>>>> 0xf8018000-0xf801a000    8192 acpi_tb_verify_table+0x1d/0x46
>>>> phys=dfef3000 ioremap
>>>> 0xf801c000-0xf801e000    8192 acpi_tb_verify_table+0x1d/0x46
>>>> phys=dfef1000 ioremap
>>>> 0xf8020000-0xf8022000    8192 acpi_tb_verify_table+0x1d/0x46
>>>> phys=dfef0000 ioremap
>>>> 0xf8024000-0xf8026000    8192 acpi_tb_verify_table+0x1d/0x46
>>>> phys=dfeef000 ioremap
>>>> 0xf8028000-0xf802a000    8192 acpi_tb_verify_table+0x1d/0x46
>>>> phys=dfeee000 ioremap
>>>> 0xf802c000-0xf802e000    8192 acpi_tb_verify_table+0x1d/0x46
>>>> phys=dfeed000 ioremap
>>>> 0xf8030000-0xf8032000    8192 acpi_tb_verify_table+0x1d/0x46
>>>> phys=dfeec000 ioremap
>>>> 0xf8038000-0xf803d000   20480 ich_force_enable_hpet+0x69/0x15a
>>>> phys=fed1c000 ioremap
>>>> 0xf803e000-0xf8040000    8192 hpet_enable+0x2a/0x21b phys=fed00000
>>>> ioremap
>>>> 0xf8040000-0xf8046000   24576 alloc_iommu+0x18d/0x1d4 phys=feb00000
>>>> ioremap
>>>> 0xf8048000-0xf804a000    8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap
>>>> 0xf804c000-0xf804e000    8192 e1000_probe+0x229/0xa73 phys=e1b20000
>>>> ioremap
>>>> 0xf804f000-0xf8051000    8192 reiserfs_init_bitmap_cache+0x32/0x65
>>>> pages=1 vmalloc
>>>> 0xf8052000-0xf8064000   73728 journal_init+0x30/0x82a pages=17 vmalloc
>>>> 0xf8065000-0xf8067000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>>>> pages=1 vmalloc
>>>> 0xf8068000-0xf806a000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>>>> pages=1 vmalloc
>>>> 0xf806b000-0xf806d000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>>>> pages=1 vmalloc
>>>> 0xf806e000-0xf8070000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>>>> pages=1 vmalloc
>>>> 0xf8071000-0xf8073000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>>>> pages=1 vmalloc
>>>> 0xf8080000-0xf80a1000  135168 e1000_probe+0x1ca/0xa73 phys=e1b00000
>>>> ioremap
>>>> 0xf80a2000-0xf80a6000   16384 e1000e_setup_rx_resources+0x20/0xf7
>>>> pages=3 vmalloc
>>>> 0xf80a7000-0xf80ab000   16384 e1000e_setup_tx_resources+0x17/0x96
>>>> pages=3 vmalloc
>>>> 0xf80ac000-0xf80b0000   16384 e1000e_setup_rx_resources+0x20/0xf7
>>>> pages=3 vmalloc
>>>> 0xf80b1000-0xf80b5000   16384 e1000e_setup_tx_resources+0x17/0x96
>>>> pages=3 vmalloc
>>>> 0xf80c0000-0xf80e1000  135168 e1000_probe+0x1ca/0xa73 phys=e1a60000
>>>> ioremap
>>>> 0xf8100000-0xf8121000  135168 e1000_probe+0x1ca/0xa73 phys=e1a20000
>>>> ioremap
>>>> 0xf8122000-0xf81b3000  593920 journal_init+0x65b/0x82a pages=144 vmalloc
>>>> 0xf81b4000-0xf822f000  503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc
>>>> 0xf846a000-0xf856c000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc
>>>>     
>>>>         
>>> This is from a 32 bit kernel.
>>>
>>> This doesnt match your previous /proc/meminfo (from a 64bit kernel on
>>> a 12 GB machine)
>>>
>>> Of course, I would like /proc/vmallocinfo on your loaded router, not from
>>> a dev machine :)
>>>
>>>   
>>>       
>> Yes sorry for no info about it.
>> I test the same kernel configurations on one 32bit machine and second 64bit
>>
>> here is meminfo from this 32bit machine working on kernel 2.6.30
>> cat /proc/meminfo
>> MemTotal:        3625444 kB
>> MemFree:         3043648 kB
>> Buffers:          133968 kB
>> Cached:            36316 kB
>> SwapCached:            0 kB
>> Active:           256868 kB
>> Inactive:          76252 kB
>> Active(anon):     163064 kB
>> Inactive(anon):        0 kB
>> Active(file):      93804 kB
>> Inactive(file):    76252 kB
>> Unevictable:           0 kB
>> Mlocked:               0 kB
>> HighTotal:       2758160 kB
>> HighFree:        2556136 kB
>> LowTotal:         867284 kB
>> LowFree:          487512 kB
>> SwapTotal:        995896 kB
>> SwapFree:         995896 kB
>> Dirty:              3624 kB
>> Writeback:             0 kB
>> AnonPages:        162912 kB
>> Mapped:             3612 kB
>> Slab:             235888 kB
>> SReclaimable:      46408 kB
>> SUnreclaim:       189480 kB
>> PageTables:          384 kB
>> NFS_Unstable:          0 kB
>> Bounce:                0 kB
>> WritebackTmp:          0 kB
>> CommitLimit:     2808616 kB
>> Committed_AS:     170648 kB
>> VmallocTotal:     122880 kB
>> VmallocUsed:        2876 kB
>> VmallocChunk:     109824 kB
>> HugePages_Total:       0
>> HugePages_Free:        0
>> HugePages_Rsvd:        0
>> HugePages_Surp:        0
>> Hugepagesize:       4096 kB
>> DirectMap4k:        8184 kB
>> DirectMap4M:      901120 kB
>> and vmallocinfo
>>
>> cat /proc/vmallocinfo
>> 0xf7ffe000-0xf8000000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfe6a000 ioremap
>> 0xf8000000-0xf8007000   28672 acpi_tb_verify_table+0x1d/0x46
>> phys=dfef5000 ioremap
>> 0xf8008000-0xf800a000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfef2000 ioremap
>> 0xf800c000-0xf800e000    8192
>> acpi_ex_system_memory_space_handler+0xd6/0x208 phys=fed1f000 ioremap
>> 0xf8010000-0xf8012000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfefb000 ioremap
>> 0xf8014000-0xf8016000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfef4000 ioremap
>> 0xf8018000-0xf801a000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfef3000 ioremap
>> 0xf801c000-0xf801e000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfef1000 ioremap
>> 0xf8020000-0xf8022000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfef0000 ioremap
>> 0xf8024000-0xf8026000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfeef000 ioremap
>> 0xf8028000-0xf802a000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfeee000 ioremap
>> 0xf802c000-0xf802e000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfeed000 ioremap
>> 0xf8030000-0xf8032000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=dfeec000 ioremap
>> 0xf8038000-0xf803d000   20480 ich_force_enable_hpet+0x69/0x15a
>> phys=fed1c000 ioremap
>> 0xf803e000-0xf8040000    8192 hpet_enable+0x2a/0x21b phys=fed00000 ioremap
>> 0xf8040000-0xf8046000   24576 alloc_iommu+0x18d/0x1d4 phys=feb00000 ioremap
>> 0xf8048000-0xf804a000    8192 pcim_iomap+0x2f/0x3a phys=e1b21000 ioremap
>> 0xf804c000-0xf804e000    8192 e1000_probe+0x229/0xa73 phys=e1b20000 ioremap
>> 0xf804f000-0xf8051000    8192 reiserfs_init_bitmap_cache+0x32/0x65
>> pages=1 vmalloc
>> 0xf8052000-0xf8064000   73728 journal_init+0x30/0x82a pages=17 vmalloc
>> 0xf8065000-0xf8067000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>> pages=1 vmalloc
>> 0xf8068000-0xf806a000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>> pages=1 vmalloc
>> 0xf806b000-0xf806d000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>> pages=1 vmalloc
>> 0xf806e000-0xf8070000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>> pages=1 vmalloc
>> 0xf8071000-0xf8073000    8192 reiserfs_allocate_list_bitmaps+0x27/0x7e
>> pages=1 vmalloc
>> 0xf8080000-0xf80a1000  135168 e1000_probe+0x1ca/0xa73 phys=e1b00000 ioremap
>> 0xf80a2000-0xf80a6000   16384 e1000e_setup_rx_resources+0x20/0xf7
>> pages=3 vmalloc
>> 0xf80a7000-0xf80ab000   16384 e1000e_setup_tx_resources+0x17/0x96
>> pages=3 vmalloc
>> 0xf80ac000-0xf80b0000   16384 e1000e_setup_rx_resources+0x20/0xf7
>> pages=3 vmalloc
>> 0xf80b1000-0xf80b5000   16384 e1000e_setup_tx_resources+0x17/0x96
>> pages=3 vmalloc
>> 0xf80c0000-0xf80e1000  135168 e1000_probe+0x1ca/0xa73 phys=e1a60000 ioremap
>> 0xf8100000-0xf8121000  135168 e1000_probe+0x1ca/0xa73 phys=e1a20000 ioremap
>> 0xf8122000-0xf81b3000  593920 journal_init+0x65b/0x82a pages=144 vmalloc
>> 0xf81b4000-0xf822f000  503808 sys_swapon+0x392/0x8f3 pages=122 vmalloc
>> 0xf8bbc000-0xf8cbe000 1056768 tnode_new+0x35/0x65 pages=257 vmalloc
>>
>>
>> And next  machine with kernel 2.6.29.3
>> dmesg:
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> Fix inflate_threshold_root. Now=15 size=11 bits
>> cat /proc/meminfo
>> MemTotal:        2072652 kB
>> MemFree:          496960 kB
>> Buffers:          267620 kB
>> Cached:           895212 kB
>> SwapCached:            0 kB
>> Active:           675744 kB
>> Inactive:         703312 kB
>> Active(anon):     215848 kB
>> Inactive(anon):        0 kB
>> Active(file):     459896 kB
>> Inactive(file):   703312 kB
>> Unevictable:           0 kB
>> Mlocked:               0 kB
>> HighTotal:       1186696 kB
>> HighFree:         151156 kB
>> LowTotal:         885956 kB
>> LowFree:          345804 kB
>> SwapTotal:       1975984 kB
>> SwapFree:        1975984 kB
>> Dirty:                20 kB
>> Writeback:             0 kB
>> AnonPages:        215724 kB
>> Mapped:             6120 kB
>> Slab:             186652 kB
>> SReclaimable:     125832 kB
>> SUnreclaim:        60820 kB
>> PageTables:          416 kB
>> NFS_Unstable:          0 kB
>> Bounce:                0 kB
>> WritebackTmp:          0 kB
>> CommitLimit:     3012308 kB
>> Committed_AS:     223692 kB
>> VmallocTotal:     122880 kB
>> VmallocUsed:        3192 kB
>> VmallocChunk:     108436 kB
>> HugePages_Total:       0
>> HugePages_Free:        0
>> HugePages_Rsvd:        0
>> HugePages_Surp:        0
>> Hugepagesize:       4096 kB
>> DirectMap4k:        8184 kB
>> DirectMap4M:      901120 kB
>> cat /proc/vmallocinfo
>> 0xf7ffe000-0xf8000000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=7fee0000 ioremap
>> 0xf8000000-0xf8005000   20480 acpi_tb_verify_table+0x1d/0x46
>> phys=7fee3000 ioremap
>> 0xf8006000-0xf8008000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=7fee3000 ioremap
>> 0xf800a000-0xf800c000    8192 acpi_tb_verify_table+0x1d/0x46
>> phys=7fee6000 ioremap
>> 0xf800d000-0xf800f000    8192 reiserfs_init_bitmap_cache+0x3b/0x80
>> pages=1 vmalloc
>> 0xf8010000-0xf8022000   73728 journal_init+0x30/0x8f0 pages=17 vmalloc
>> 0xf8023000-0xf8025000    8192 reiserfs_allocate_list_bitmaps+0x2d/0x90
>> pages=1 vmalloc
>> 0xf8026000-0xf8028000    8192 reiserfs_allocate_list_bitmaps+0x2d/0x90
>> pages=1 vmalloc
>> 0xf8029000-0xf802b000    8192 reiserfs_allocate_list_bitmaps+0x2d/0x90
>> pages=1 vmalloc
>> 0xf802c000-0xf802e000    8192 reiserfs_allocate_list_bitmaps+0x2d/0x90
>> pages=1 vmalloc
>> 0xf802f000-0xf8031000    8192 reiserfs_allocate_list_bitmaps+0x2d/0x90
>> pages=1 vmalloc
>> 0xf803e000-0xf8040000    8192 e1000_setup_all_tx_resources+0x57/0x660
>> pages=1 vmalloc
>> 0xf8040000-0xf8061000  135168 e1000_probe+0x207/0xeb0 phys=f5000000 ioremap
>> 0xf8062000-0xf8064000    8192 e1000_setup_all_rx_resources+0x57/0x6d0
>> pages=1 vmalloc
>> 0xf8065000-0xf8067000    8192 e1000_setup_all_tx_resources+0x57/0x660
>> pages=1 vmalloc
>> 0xf8068000-0xf806a000    8192 e1000_setup_all_rx_resources+0x57/0x6d0
>> pages=1 vmalloc
>> 0xf806b000-0xf806d000    8192 e1000_setup_all_tx_resources+0x57/0x660
>> pages=1 vmalloc
>> 0xf806e000-0xf8070000    8192 e1000_setup_all_rx_resources+0x57/0x6d0
>> pages=1 vmalloc
>> 0xf8080000-0xf80a1000  135168 e1000_probe+0x207/0xeb0 phys=f1040000 ioremap
>> 0xf80c0000-0xf80e1000  135168 e1000_probe+0x207/0xeb0 phys=f4000000 ioremap
>> 0xf80e2000-0xf8173000  593920 journal_init+0x56e/0x8f0 pages=144 vmalloc
>> 0xf8174000-0xf8267000  995328 sys_swapon+0x548/0xa30 pages=242 vmalloc
>> 0xf8d17000-0xf8e19000 1056768 tnode_new+0x7f/0x90 pages=257 vmalloc
>>
>>
>> because i have this info on 5 machines that working in ibgp mesh
>> And only one 64bit dev machine that is one of failover member - but i
>> kill this machine  after upgrade to kernel 2.6.31-rc1
>>     
>
> Yes, I was a fool to ask you to try 2.6.31-rc1, sorry.
>
>   
No problem with this test i lost only one test failover and no traffic 
lost when system switch to primary routers. :)
> Even 2.6.30 is too young for a production machine.
>   
I alvays make like this - i have iBGP mesh with main access path of 
machines on stable 2.6.28.9 kernels and second failover path  on 
machines  that use  newest kernel for  testing in this case  2.6.29 but 
after some problems i try also 2.6.30  yestarday.
> 2.6.29.5 contains the fixes, Pawel, did you tried this version ?
>
>
>   
I will try 2.6.29.5 today

Thanks
Paweł Staszewski

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 10:47           ` Paweł Staszewski
@ 2009-06-26 10:52             ` Eric Dumazet
  2009-06-26 17:26               ` Paweł Staszewski
  0 siblings, 1 reply; 99+ messages in thread
From: Eric Dumazet @ 2009-06-26 10:52 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

Paweł Staszewski a écrit :
> Eric Dumazet pisze:
     
>>
>> Yes, I was a fool to ask you to try 2.6.31-rc1, sorry.
>>
>>   
> No problem with this test i lost only one test failover and no traffic
> lost when system switch to primary routers. :)
>> Even 2.6.30 is too young for a production machine.
>>   
> I alvays make like this - i have iBGP mesh with main access path of
> machines on stable 2.6.28.9 kernels and second failover path  on
> machines  that use  newest kernel for  testing in this case  2.6.29 but
> after some problems i try also 2.6.30  yestarday.
>> 2.6.29.5 contains the fixes, Pawel, did you tried this version ?
>>
>>
>>   
> I will try 2.6.29.5 today
> 
OK thanks

Please report (while machine has enough load) output of

rtstat -c20 -i1

(rtstat is a symbolic link to lnstat, if not provided by your distro)


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26  9:37       ` Jarek Poplawski
  2009-06-26 10:26         ` Jorge Boncompte [DTI2]
@ 2009-06-26 12:42         ` Robert Olsson
  2009-06-26 12:54           ` Jarek Poplawski
  1 sibling, 1 reply; 99+ messages in thread
From: Robert Olsson @ 2009-06-26 12:42 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list


Jarek Poplawski writes:

 > >  But maybe memory allocation experts has some good suggestions.
 > 
 > Pawel has reported these problems for a long time:
 > http://bugzilla.kernel.org/show_bug.cgi?id=6648
 > 
 > So, until it's fully investigated, it seems some 'fast' fix is needed
 > here.

 We talked about having a fixed pre-allocated root-node long ago but it's only 
 optimisation for routers w. full BGP. Best if memory problems got solved.
 
 Cheers
						--ro

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 12:42         ` Robert Olsson
@ 2009-06-26 12:54           ` Jarek Poplawski
  2009-06-26 13:28             ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-26 12:54 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On Fri, Jun 26, 2009 at 02:42:12PM +0200, Robert Olsson wrote:
> 
> Jarek Poplawski writes:
> 
>  > >  But maybe memory allocation experts has some good suggestions.
>  > 
>  > Pawel has reported these problems for a long time:
>  > http://bugzilla.kernel.org/show_bug.cgi?id=6648
>  > 
>  > So, until it's fully investigated, it seems some 'fast' fix is needed
>  > here.
> 
>  We talked about having a fixed pre-allocated root-node long ago but it's only 
>  optimisation for routers w. full BGP. Best if memory problems got solved.
>  

I think the current process of rebalancing can allocate and hold
unnecessarily long a lot of 'temp' memory, so probably something
like the patch below could be useful. It should be applied to the
2.6.30 after two patches below (from 2.6.31-rc). (Alas I can't even
compile-test it now).

Cheers,
Jarek P.

--- (for testing)

 net/ipv4/fib_trie.c |   24 ++++++++++++++++++------
 1 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..c2fc862 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -366,6 +366,14 @@ static void __tnode_vfree(struct work_struct *arg)
 	vfree(tn);
 }
 
+static void __tnode_free(struct tnode *tn)
+{
+	if (size <= PAGE_SIZE)
+		kfree(tn);
+	else
+		vfree(tn);
+}
+
 static void __tnode_free_rcu(struct rcu_head *head)
 {
 	struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -402,7 +410,7 @@ static void tnode_free_flush(void)
 	while ((tn = tnode_free_head)) {
 		tnode_free_head = tn->tnode_free;
 		tn->tnode_free = NULL;
-		tnode_free(tn);
+		__tnode_free(tn);
 	}
 }
 
@@ -1020,19 +1028,23 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 		tnode_put_child_reorg((struct tnode *)tp, cindex,
 				      (struct node *)tn, wasfull);
 
-		tp = node_parent((struct node *) tn);
+		synchronize_rcu();
 		tnode_free_flush();
+		tp = node_parent((struct node *) tn);
 		if (!tp)
 			break;
 		tn = tp;
 	}
 
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn))
+	if (IS_TNODE(tn)) {
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
-
-	rcu_assign_pointer(t->trie, (struct node *)tn);
-	tnode_free_flush();
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+		synchronize_rcu();
+		tnode_free_flush();
+	} else {
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+	}
 
 	return;
 }

---
commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Mon Jun 15 02:31:29 2009 -0700

    ipv4: Fix fib_trie rebalancing
    
    While doing trie_rebalance(): resize(), inflate(), halve() RCU free
    tnodes before updating their parents. It depends on RCU delaying the
    real destruction, but if RCU readers start after call_rcu() and before
    parent update they could access freed memory.
    
    It is currently prevented with preempt_disable() on the update side,
    but it's not safe, except maybe classic RCU, plus it conflicts with
    memory allocations with GFP_KERNEL flag used from these functions.
    
    This patch explicitly delays freeing of tnodes by adding them to the
    list, which is flushed after the update is finished.
    
    Reported-by: Yan Zheng <zheng.yan@oracle.com>
    Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 538d2a9..d1a39b1 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -123,6 +123,7 @@ struct tnode {
 	union {
 		struct rcu_head rcu;
 		struct work_struct work;
+		struct tnode *tnode_free;
 	};
 	struct node *child[0];
 };
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct tnode *tn, int i, struct node *n,
 static struct node *resize(struct trie *t, struct tnode *tn);
 static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -385,6 +388,29 @@ static inline void tnode_free(struct tnode *tn)
 		call_rcu(&tn->rcu, __tnode_free_rcu);
 }
 
+static void tnode_free_safe(struct tnode *tn)
+{
+	BUG_ON(IS_LEAF(tn));
+
+	if (node_parent((struct node *) tn)) {
+		tn->tnode_free = tnode_free_head;
+		tnode_free_head = tn;
+	} else {
+		tnode_free(tn);
+	}
+}
+
+static void tnode_free_flush(void)
+{
+	struct tnode *tn;
+
+	while ((tn = tnode_free_head)) {
+		tnode_free_head = tn->tnode_free;
+		tn->tnode_free = NULL;
+		tnode_free(tn);
+	}
+}
+
 static struct leaf *leaf_new(void)
 {
 	struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
@@ -495,7 +521,7 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 
 	/* No children */
 	if (tn->empty_children == tnode_child_length(tn)) {
-		tnode_free(tn);
+		tnode_free_safe(tn);
 		return NULL;
 	}
 	/* One child */
@@ -509,7 +535,7 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 
 			/* compress one level */
 			node_set_parent(n, NULL);
-			tnode_free(tn);
+			tnode_free_safe(tn);
 			return n;
 		}
 	/*
@@ -670,7 +696,7 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 			/* compress one level */
 
 			node_set_parent(n, NULL);
-			tnode_free(tn);
+			tnode_free_safe(tn);
 			return n;
 		}
 
@@ -756,7 +782,7 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn)
 			put_child(t, tn, 2*i, inode->child[0]);
 			put_child(t, tn, 2*i+1, inode->child[1]);
 
-			tnode_free(inode);
+			tnode_free_safe(inode);
 			continue;
 		}
 
@@ -801,9 +827,9 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn)
 		put_child(t, tn, 2*i, resize(t, left));
 		put_child(t, tn, 2*i+1, resize(t, right));
 
-		tnode_free(inode);
+		tnode_free_safe(inode);
 	}
-	tnode_free(oldtnode);
+	tnode_free_safe(oldtnode);
 	return tn;
 nomem:
 	{
@@ -885,7 +911,7 @@ static struct tnode *halve(struct trie *t, struct tnode *tn)
 		put_child(t, newBinNode, 1, right);
 		put_child(t, tn, i/2, resize(t, newBinNode));
 	}
-	tnode_free(oldtnode);
+	tnode_free_safe(oldtnode);
 	return tn;
 nomem:
 	{
@@ -989,7 +1015,6 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
 	t_key cindex, key;
 	struct tnode *tp;
 
-	preempt_disable();
 	key = tn->key;
 
 	while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) {
@@ -1001,16 +1026,18 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
 				      (struct node *)tn, wasfull);
 
 		tp = node_parent((struct node *) tn);
+		tnode_free_flush();
 		if (!tp)
 			break;
 		tn = tp;
 	}
 
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn))
+	if (IS_TNODE(tn)) {
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
+		tnode_free_flush();
+	}
 
-	preempt_enable();
 	return (struct node *)tn;
 }
 
---
commit 7b85576d15bf2574b0a451108f59f9ad4170dd3f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Thu Jun 18 00:28:51 2009 -0700

    ipv4: Fix fib_trie rebalancing, part 2
    
    My previous patch, which explicitly delays freeing of tnodes by adding
    them to the list to flush them after the update is finished, isn't
    strict enough. It treats exceptionally tnodes without parent, assuming
    they are newly created, so "invisible" for the read side yet.
    
    But the top tnode doesn't have parent as well, so we have to exclude
    all exceptions (at least until a better way is found). Additionally we
    need to move rcu assignment of this node before flushing, so the
    return type of the trie_rebalance() function is changed.
    
    Reported-by: Yan Zheng <zheng.yan@oracle.com>
    Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index d1a39b1..012cf5a 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -391,13 +391,8 @@ static inline void tnode_free(struct tnode *tn)
 static void tnode_free_safe(struct tnode *tn)
 {
 	BUG_ON(IS_LEAF(tn));
-
-	if (node_parent((struct node *) tn)) {
-		tn->tnode_free = tnode_free_head;
-		tnode_free_head = tn;
-	} else {
-		tnode_free(tn);
-	}
+	tn->tnode_free = tnode_free_head;
+	tnode_free_head = tn;
 }
 
 static void tnode_free_flush(void)
@@ -1009,7 +1004,7 @@ fib_find_node(struct trie *t, u32 key)
 	return NULL;
 }
 
-static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
+static void trie_rebalance(struct trie *t, struct tnode *tn)
 {
 	int wasfull;
 	t_key cindex, key;
@@ -1033,12 +1028,13 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
 	}
 
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn)) {
+	if (IS_TNODE(tn))
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
-		tnode_free_flush();
-	}
 
-	return (struct node *)tn;
+	rcu_assign_pointer(t->trie, (struct node *)tn);
+	tnode_free_flush();
+
+	return;
 }
 
 /* only used from updater-side */
@@ -1186,7 +1182,7 @@ static struct list_head *fib_insert_node(struct trie *t, u32 key, int plen)
 
 	/* Rebalance the trie */
 
-	rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
+	trie_rebalance(t, tp);
 done:
 	return fa_head;
 }
@@ -1605,7 +1601,7 @@ static void trie_leaf_remove(struct trie *t, struct leaf *l)
 	if (tp) {
 		t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits);
 		put_child(t, (struct tnode *)tp, cindex, NULL);
-		rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
+		trie_rebalance(t, tp);
 	} else
 		rcu_assign_pointer(t->trie, NULL);
 

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 12:54           ` Jarek Poplawski
@ 2009-06-26 13:28             ` Jarek Poplawski
  2009-06-26 13:52               ` Robert Olsson
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-26 13:28 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On Fri, Jun 26, 2009 at 12:54:49PM +0000, Jarek Poplawski wrote:
> On Fri, Jun 26, 2009 at 02:42:12PM +0200, Robert Olsson wrote:
> > 
> > Jarek Poplawski writes:
> > 
> >  > >  But maybe memory allocation experts has some good suggestions.
> >  > 
> >  > Pawel has reported these problems for a long time:
> >  > http://bugzilla.kernel.org/show_bug.cgi?id=6648
> >  > 
> >  > So, until it's fully investigated, it seems some 'fast' fix is needed
> >  > here.
> > 
> >  We talked about having a fixed pre-allocated root-node long ago but it's only 
> >  optimisation for routers w. full BGP. Best if memory problems got solved.
> >  
> 
> I think the current process of rebalancing can allocate and hold
> unnecessarily long a lot of 'temp' memory, so probably something
> like the patch below could be useful. It should be applied to the
> 2.6.30 after two patches below (from 2.6.31-rc). (Alas I can't even
> compile-test it now).
> 

Alternatively here is a faster version with less synchronize_rcu().

Jarek P.

--- (take 2 - for testing)

 net/ipv4/fib_trie.c |   27 +++++++++++++++++++++------
 1 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..2936b2e 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -366,6 +366,14 @@ static void __tnode_vfree(struct work_struct *arg)
 	vfree(tn);
 }
 
+static void __tnode_free(struct tnode *tn)
+{
+	if (size <= PAGE_SIZE)
+		kfree(tn);
+	else
+		vfree(tn);
+}
+
 static void __tnode_free_rcu(struct rcu_head *head)
 {
 	struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -402,7 +410,7 @@ static void tnode_free_flush(void)
 	while ((tn = tnode_free_head)) {
 		tnode_free_head = tn->tnode_free;
 		tn->tnode_free = NULL;
-		tnode_free(tn);
+		__tnode_free(tn);
 	}
 }
 
@@ -1021,18 +1029,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 				      (struct node *)tn, wasfull);
 
 		tp = node_parent((struct node *) tn);
-		tnode_free_flush();
 		if (!tp)
 			break;
 		tn = tp;
 	}
 
+	if (tnode_free_head) {
+		synchronize_rcu();
+		tnode_free_flush();
+	}
+
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn))
+	if (IS_TNODE(tn)) {
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
-
-	rcu_assign_pointer(t->trie, (struct node *)tn);
-	tnode_free_flush();
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+		synchronize_rcu();
+		tnode_free_flush();
+	} else {
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+	}
 
 	return;
 }



---
commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Mon Jun 15 02:31:29 2009 -0700

    ipv4: Fix fib_trie rebalancing
    
    While doing trie_rebalance(): resize(), inflate(), halve() RCU free
    tnodes before updating their parents. It depends on RCU delaying the
    real destruction, but if RCU readers start after call_rcu() and before
    parent update they could access freed memory.
    
    It is currently prevented with preempt_disable() on the update side,
    but it's not safe, except maybe classic RCU, plus it conflicts with
    memory allocations with GFP_KERNEL flag used from these functions.
    
    This patch explicitly delays freeing of tnodes by adding them to the
    list, which is flushed after the update is finished.
    
    Reported-by: Yan Zheng <zheng.yan@oracle.com>
    Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 538d2a9..d1a39b1 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -123,6 +123,7 @@ struct tnode {
 	union {
 		struct rcu_head rcu;
 		struct work_struct work;
+		struct tnode *tnode_free;
 	};
 	struct node *child[0];
 };
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct tnode *tn, int i, struct node *n,
 static struct node *resize(struct trie *t, struct tnode *tn);
 static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -385,6 +388,29 @@ static inline void tnode_free(struct tnode *tn)
 		call_rcu(&tn->rcu, __tnode_free_rcu);
 }
 
+static void tnode_free_safe(struct tnode *tn)
+{
+	BUG_ON(IS_LEAF(tn));
+
+	if (node_parent((struct node *) tn)) {
+		tn->tnode_free = tnode_free_head;
+		tnode_free_head = tn;
+	} else {
+		tnode_free(tn);
+	}
+}
+
+static void tnode_free_flush(void)
+{
+	struct tnode *tn;
+
+	while ((tn = tnode_free_head)) {
+		tnode_free_head = tn->tnode_free;
+		tn->tnode_free = NULL;
+		tnode_free(tn);
+	}
+}
+
 static struct leaf *leaf_new(void)
 {
 	struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
@@ -495,7 +521,7 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 
 	/* No children */
 	if (tn->empty_children == tnode_child_length(tn)) {
-		tnode_free(tn);
+		tnode_free_safe(tn);
 		return NULL;
 	}
 	/* One child */
@@ -509,7 +535,7 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 
 			/* compress one level */
 			node_set_parent(n, NULL);
-			tnode_free(tn);
+			tnode_free_safe(tn);
 			return n;
 		}
 	/*
@@ -670,7 +696,7 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 			/* compress one level */
 
 			node_set_parent(n, NULL);
-			tnode_free(tn);
+			tnode_free_safe(tn);
 			return n;
 		}
 
@@ -756,7 +782,7 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn)
 			put_child(t, tn, 2*i, inode->child[0]);
 			put_child(t, tn, 2*i+1, inode->child[1]);
 
-			tnode_free(inode);
+			tnode_free_safe(inode);
 			continue;
 		}
 
@@ -801,9 +827,9 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn)
 		put_child(t, tn, 2*i, resize(t, left));
 		put_child(t, tn, 2*i+1, resize(t, right));
 
-		tnode_free(inode);
+		tnode_free_safe(inode);
 	}
-	tnode_free(oldtnode);
+	tnode_free_safe(oldtnode);
 	return tn;
 nomem:
 	{
@@ -885,7 +911,7 @@ static struct tnode *halve(struct trie *t, struct tnode *tn)
 		put_child(t, newBinNode, 1, right);
 		put_child(t, tn, i/2, resize(t, newBinNode));
 	}
-	tnode_free(oldtnode);
+	tnode_free_safe(oldtnode);
 	return tn;
 nomem:
 	{
@@ -989,7 +1015,6 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
 	t_key cindex, key;
 	struct tnode *tp;
 
-	preempt_disable();
 	key = tn->key;
 
 	while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) {
@@ -1001,16 +1026,18 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
 				      (struct node *)tn, wasfull);
 
 		tp = node_parent((struct node *) tn);
+		tnode_free_flush();
 		if (!tp)
 			break;
 		tn = tp;
 	}
 
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn))
+	if (IS_TNODE(tn)) {
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
+		tnode_free_flush();
+	}
 
-	preempt_enable();
 	return (struct node *)tn;
 }
 
---
commit 7b85576d15bf2574b0a451108f59f9ad4170dd3f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Thu Jun 18 00:28:51 2009 -0700

    ipv4: Fix fib_trie rebalancing, part 2
    
    My previous patch, which explicitly delays freeing of tnodes by adding
    them to the list to flush them after the update is finished, isn't
    strict enough. It treats exceptionally tnodes without parent, assuming
    they are newly created, so "invisible" for the read side yet.
    
    But the top tnode doesn't have parent as well, so we have to exclude
    all exceptions (at least until a better way is found). Additionally we
    need to move rcu assignment of this node before flushing, so the
    return type of the trie_rebalance() function is changed.
    
    Reported-by: Yan Zheng <zheng.yan@oracle.com>
    Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index d1a39b1..012cf5a 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -391,13 +391,8 @@ static inline void tnode_free(struct tnode *tn)
 static void tnode_free_safe(struct tnode *tn)
 {
 	BUG_ON(IS_LEAF(tn));
-
-	if (node_parent((struct node *) tn)) {
-		tn->tnode_free = tnode_free_head;
-		tnode_free_head = tn;
-	} else {
-		tnode_free(tn);
-	}
+	tn->tnode_free = tnode_free_head;
+	tnode_free_head = tn;
 }
 
 static void tnode_free_flush(void)
@@ -1009,7 +1004,7 @@ fib_find_node(struct trie *t, u32 key)
 	return NULL;
 }
 
-static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
+static void trie_rebalance(struct trie *t, struct tnode *tn)
 {
 	int wasfull;
 	t_key cindex, key;
@@ -1033,12 +1028,13 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
 	}
 
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn)) {
+	if (IS_TNODE(tn))
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
-		tnode_free_flush();
-	}
 
-	return (struct node *)tn;
+	rcu_assign_pointer(t->trie, (struct node *)tn);
+	tnode_free_flush();
+
+	return;
 }
 
 /* only used from updater-side */
@@ -1186,7 +1182,7 @@ static struct list_head *fib_insert_node(struct trie *t, u32 key, int plen)
 
 	/* Rebalance the trie */
 
-	rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
+	trie_rebalance(t, tp);
 done:
 	return fa_head;
 }
@@ -1605,7 +1601,7 @@ static void trie_leaf_remove(struct trie *t, struct leaf *l)
 	if (tp) {
 		t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits);
 		put_child(t, (struct tnode *)tp, cindex, NULL);
-		rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
+		trie_rebalance(t, tp);
 	} else
 		rcu_assign_pointer(t->trie, NULL);
 

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 13:28             ` Jarek Poplawski
@ 2009-06-26 13:52               ` Robert Olsson
  2009-06-26 15:10                 ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Robert Olsson @ 2009-06-26 13:52 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list


Jarek Poplawski writes:

 Thanks, 

 Should be worth testing so we synchronize_rcu instead of doing call_rcu's
 
 Cheers
					--ro


 > Alternatively here is a faster version with less synchronize_rcu().
 > 
 > Jarek P.
 > 
 > --- (take 2 - for testing)
 > 
 >  net/ipv4/fib_trie.c |   27 +++++++++++++++++++++------
 >  1 files changed, 21 insertions(+), 6 deletions(-)
 > 
 > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
 > index 012cf5a..2936b2e 100644
 > --- a/net/ipv4/fib_trie.c
 > +++ b/net/ipv4/fib_trie.c
 > @@ -366,6 +366,14 @@ static void __tnode_vfree(struct work_struct *arg)
 >  	vfree(tn);
 >  }
 >  
 > +static void __tnode_free(struct tnode *tn)
 > +{
 > +	if (size <= PAGE_SIZE)
 > +		kfree(tn);
 > +	else
 > +		vfree(tn);
 > +}
 > +
 >  static void __tnode_free_rcu(struct rcu_head *head)
 >  {
 >  	struct tnode *tn = container_of(head, struct tnode, rcu);
 > @@ -402,7 +410,7 @@ static void tnode_free_flush(void)
 >  	while ((tn = tnode_free_head)) {
 >  		tnode_free_head = tn->tnode_free;
 >  		tn->tnode_free = NULL;
 > -		tnode_free(tn);
 > +		__tnode_free(tn);
 >  	}
 >  }
 >  
 > @@ -1021,18 +1029,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 >  				      (struct node *)tn, wasfull);
 >  
 >  		tp = node_parent((struct node *) tn);
 > -		tnode_free_flush();
 >  		if (!tp)
 >  			break;
 >  		tn = tp;
 >  	}
 >  
 > +	if (tnode_free_head) {
 > +		synchronize_rcu();
 > +		tnode_free_flush();
 > +	}
 > +
 >  	/* Handle last (top) tnode */
 > -	if (IS_TNODE(tn))
 > +	if (IS_TNODE(tn)) {
 >  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
 > -
 > -	rcu_assign_pointer(t->trie, (struct node *)tn);
 > -	tnode_free_flush();
 > +		rcu_assign_pointer(t->trie, (struct node *)tn);
 > +		synchronize_rcu();
 > +		tnode_free_flush();
 > +	} else {
 > +		rcu_assign_pointer(t->trie, (struct node *)tn);
 > +	}
 >  
 >  	return;
 >  }
 > 
 > 
 > 
 > ---
 > commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f
 > Author: Jarek Poplawski <jarkao2@gmail.com>
 > Date:   Mon Jun 15 02:31:29 2009 -0700
 > 
 >     ipv4: Fix fib_trie rebalancing
 >     
 >     While doing trie_rebalance(): resize(), inflate(), halve() RCU free
 >     tnodes before updating their parents. It depends on RCU delaying the
 >     real destruction, but if RCU readers start after call_rcu() and before
 >     parent update they could access freed memory.
 >     
 >     It is currently prevented with preempt_disable() on the update side,
 >     but it's not safe, except maybe classic RCU, plus it conflicts with
 >     memory allocations with GFP_KERNEL flag used from these functions.
 >     
 >     This patch explicitly delays freeing of tnodes by adding them to the
 >     list, which is flushed after the update is finished.
 >     
 >     Reported-by: Yan Zheng <zheng.yan@oracle.com>
 >     Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
 >     Signed-off-by: David S. Miller <davem@davemloft.net>
 > 
 > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
 > index 538d2a9..d1a39b1 100644
 > --- a/net/ipv4/fib_trie.c
 > +++ b/net/ipv4/fib_trie.c
 > @@ -123,6 +123,7 @@ struct tnode {
 >  	union {
 >  		struct rcu_head rcu;
 >  		struct work_struct work;
 > +		struct tnode *tnode_free;
 >  	};
 >  	struct node *child[0];
 >  };
 > @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct tnode *tn, int i, struct node *n,
 >  static struct node *resize(struct trie *t, struct tnode *tn);
 >  static struct tnode *inflate(struct trie *t, struct tnode *tn);
 >  static struct tnode *halve(struct trie *t, struct tnode *tn);
 > +/* tnodes to free after resize(); protected by RTNL */
 > +static struct tnode *tnode_free_head;
 >  
 >  static struct kmem_cache *fn_alias_kmem __read_mostly;
 >  static struct kmem_cache *trie_leaf_kmem __read_mostly;
 > @@ -385,6 +388,29 @@ static inline void tnode_free(struct tnode *tn)
 >  		call_rcu(&tn->rcu, __tnode_free_rcu);
 >  }
 >  
 > +static void tnode_free_safe(struct tnode *tn)
 > +{
 > +	BUG_ON(IS_LEAF(tn));
 > +
 > +	if (node_parent((struct node *) tn)) {
 > +		tn->tnode_free = tnode_free_head;
 > +		tnode_free_head = tn;
 > +	} else {
 > +		tnode_free(tn);
 > +	}
 > +}
 > +
 > +static void tnode_free_flush(void)
 > +{
 > +	struct tnode *tn;
 > +
 > +	while ((tn = tnode_free_head)) {
 > +		tnode_free_head = tn->tnode_free;
 > +		tn->tnode_free = NULL;
 > +		tnode_free(tn);
 > +	}
 > +}
 > +
 >  static struct leaf *leaf_new(void)
 >  {
 >  	struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
 > @@ -495,7 +521,7 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 >  
 >  	/* No children */
 >  	if (tn->empty_children == tnode_child_length(tn)) {
 > -		tnode_free(tn);
 > +		tnode_free_safe(tn);
 >  		return NULL;
 >  	}
 >  	/* One child */
 > @@ -509,7 +535,7 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 >  
 >  			/* compress one level */
 >  			node_set_parent(n, NULL);
 > -			tnode_free(tn);
 > +			tnode_free_safe(tn);
 >  			return n;
 >  		}
 >  	/*
 > @@ -670,7 +696,7 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 >  			/* compress one level */
 >  
 >  			node_set_parent(n, NULL);
 > -			tnode_free(tn);
 > +			tnode_free_safe(tn);
 >  			return n;
 >  		}
 >  
 > @@ -756,7 +782,7 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn)
 >  			put_child(t, tn, 2*i, inode->child[0]);
 >  			put_child(t, tn, 2*i+1, inode->child[1]);
 >  
 > -			tnode_free(inode);
 > +			tnode_free_safe(inode);
 >  			continue;
 >  		}
 >  
 > @@ -801,9 +827,9 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn)
 >  		put_child(t, tn, 2*i, resize(t, left));
 >  		put_child(t, tn, 2*i+1, resize(t, right));
 >  
 > -		tnode_free(inode);
 > +		tnode_free_safe(inode);
 >  	}
 > -	tnode_free(oldtnode);
 > +	tnode_free_safe(oldtnode);
 >  	return tn;
 >  nomem:
 >  	{
 > @@ -885,7 +911,7 @@ static struct tnode *halve(struct trie *t, struct tnode *tn)
 >  		put_child(t, newBinNode, 1, right);
 >  		put_child(t, tn, i/2, resize(t, newBinNode));
 >  	}
 > -	tnode_free(oldtnode);
 > +	tnode_free_safe(oldtnode);
 >  	return tn;
 >  nomem:
 >  	{
 > @@ -989,7 +1015,6 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
 >  	t_key cindex, key;
 >  	struct tnode *tp;
 >  
 > -	preempt_disable();
 >  	key = tn->key;
 >  
 >  	while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) {
 > @@ -1001,16 +1026,18 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
 >  				      (struct node *)tn, wasfull);
 >  
 >  		tp = node_parent((struct node *) tn);
 > +		tnode_free_flush();
 >  		if (!tp)
 >  			break;
 >  		tn = tp;
 >  	}
 >  
 >  	/* Handle last (top) tnode */
 > -	if (IS_TNODE(tn))
 > +	if (IS_TNODE(tn)) {
 >  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
 > +		tnode_free_flush();
 > +	}
 >  
 > -	preempt_enable();
 >  	return (struct node *)tn;
 >  }
 >  
 > ---
 > commit 7b85576d15bf2574b0a451108f59f9ad4170dd3f
 > Author: Jarek Poplawski <jarkao2@gmail.com>
 > Date:   Thu Jun 18 00:28:51 2009 -0700
 > 
 >     ipv4: Fix fib_trie rebalancing, part 2
 >     
 >     My previous patch, which explicitly delays freeing of tnodes by adding
 >     them to the list to flush them after the update is finished, isn't
 >     strict enough. It treats exceptionally tnodes without parent, assuming
 >     they are newly created, so "invisible" for the read side yet.
 >     
 >     But the top tnode doesn't have parent as well, so we have to exclude
 >     all exceptions (at least until a better way is found). Additionally we
 >     need to move rcu assignment of this node before flushing, so the
 >     return type of the trie_rebalance() function is changed.
 >     
 >     Reported-by: Yan Zheng <zheng.yan@oracle.com>
 >     Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
 >     Signed-off-by: David S. Miller <davem@davemloft.net>
 > 
 > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
 > index d1a39b1..012cf5a 100644
 > --- a/net/ipv4/fib_trie.c
 > +++ b/net/ipv4/fib_trie.c
 > @@ -391,13 +391,8 @@ static inline void tnode_free(struct tnode *tn)
 >  static void tnode_free_safe(struct tnode *tn)
 >  {
 >  	BUG_ON(IS_LEAF(tn));
 > -
 > -	if (node_parent((struct node *) tn)) {
 > -		tn->tnode_free = tnode_free_head;
 > -		tnode_free_head = tn;
 > -	} else {
 > -		tnode_free(tn);
 > -	}
 > +	tn->tnode_free = tnode_free_head;
 > +	tnode_free_head = tn;
 >  }
 >  
 >  static void tnode_free_flush(void)
 > @@ -1009,7 +1004,7 @@ fib_find_node(struct trie *t, u32 key)
 >  	return NULL;
 >  }
 >  
 > -static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
 > +static void trie_rebalance(struct trie *t, struct tnode *tn)
 >  {
 >  	int wasfull;
 >  	t_key cindex, key;
 > @@ -1033,12 +1028,13 @@ static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
 >  	}
 >  
 >  	/* Handle last (top) tnode */
 > -	if (IS_TNODE(tn)) {
 > +	if (IS_TNODE(tn))
 >  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
 > -		tnode_free_flush();
 > -	}
 >  
 > -	return (struct node *)tn;
 > +	rcu_assign_pointer(t->trie, (struct node *)tn);
 > +	tnode_free_flush();
 > +
 > +	return;
 >  }
 >  
 >  /* only used from updater-side */
 > @@ -1186,7 +1182,7 @@ static struct list_head *fib_insert_node(struct trie *t, u32 key, int plen)
 >  
 >  	/* Rebalance the trie */
 >  
 > -	rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
 > +	trie_rebalance(t, tp);
 >  done:
 >  	return fa_head;
 >  }
 > @@ -1605,7 +1601,7 @@ static void trie_leaf_remove(struct trie *t, struct leaf *l)
 >  	if (tp) {
 >  		t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits);
 >  		put_child(t, (struct tnode *)tp, cindex, NULL);
 > -		rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
 > +		trie_rebalance(t, tp);
 >  	} else
 >  		rcu_assign_pointer(t->trie, NULL);
 >  
 > --
 > To unsubscribe from this list: send the line "unsubscribe netdev" in
 > the body of a message to majordomo@vger.kernel.org
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 13:52               ` Robert Olsson
@ 2009-06-26 15:10                 ` Jarek Poplawski
  2009-06-26 15:30                   ` Paul E. McKenney
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-26 15:10 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Robert Olsson, Eric Dumazet, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote:
> 
> Jarek Poplawski writes:
> 
>  Thanks, 
> 
>  Should be worth testing so we synchronize_rcu instead of doing call_rcu's
>  

Alas take 2 (nor 1) doesn't compile, so here it is again.

Thanks,
Jarek P.
--- (take 3 - for testing)

 net/ipv4/fib_trie.c |   30 ++++++++++++++++++++++++------
 1 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..1a4c4b7 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg)
 	vfree(tn);
 }
 
+static void __tnode_free(struct tnode *tn)
+{
+	size_t size = sizeof(struct tnode) +
+		      (sizeof(struct node *) << tn->bits);
+
+	if (size <= PAGE_SIZE)
+		kfree(tn);
+	else
+		vfree(tn);
+}
+
 static void __tnode_free_rcu(struct rcu_head *head)
 {
 	struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -402,7 +413,7 @@ static void tnode_free_flush(void)
 	while ((tn = tnode_free_head)) {
 		tnode_free_head = tn->tnode_free;
 		tn->tnode_free = NULL;
-		tnode_free(tn);
+		__tnode_free(tn);
 	}
 }
 
@@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 				      (struct node *)tn, wasfull);
 
 		tp = node_parent((struct node *) tn);
-		tnode_free_flush();
 		if (!tp)
 			break;
 		tn = tp;
 	}
 
+	if (tnode_free_head) {
+		synchronize_rcu();
+		tnode_free_flush();
+	}
+
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn))
+	if (IS_TNODE(tn)) {
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
-
-	rcu_assign_pointer(t->trie, (struct node *)tn);
-	tnode_free_flush();
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+		synchronize_rcu();
+		tnode_free_flush();
+	} else {
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+	}
 
 	return;
 }

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 15:10                 ` Jarek Poplawski
@ 2009-06-26 15:30                   ` Paul E. McKenney
  2009-06-26 15:54                     ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Paul E. McKenney @ 2009-06-26 15:30 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Robert Olsson, Robert Olsson, Eric Dumazet,
	=?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote:
> On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote:
> > 
> > Jarek Poplawski writes:
> > 
> >  Thanks, 
> > 
> >  Should be worth testing so we synchronize_rcu instead of doing call_rcu's
> >  
> 
> Alas take 2 (nor 1) doesn't compile, so here it is again.

So the idea is to balance memory and latency, so that large changes
(those affecting the root node) get at least one synchronize_rcu(),
while smaller changes just use call_rcu(), correct?  This means that
the amount of memory awaiting an RCU grace period is limited, but
the algorithm avoids per-node synchronize_rcu() overhead.

If I understand the goal correctly, looks good!  (Give or take my
limited understanding of fib_trie and is usage, of course.)

							Thanx, Paul

> Thanks,
> Jarek P.
> --- (take 3 - for testing)
> 
>  net/ipv4/fib_trie.c |   30 ++++++++++++++++++++++++------
>  1 files changed, 24 insertions(+), 6 deletions(-)
> 
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 012cf5a..1a4c4b7 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg)
>  	vfree(tn);
>  }
> 
> +static void __tnode_free(struct tnode *tn)
> +{
> +	size_t size = sizeof(struct tnode) +
> +		      (sizeof(struct node *) << tn->bits);
> +
> +	if (size <= PAGE_SIZE)
> +		kfree(tn);
> +	else
> +		vfree(tn);
> +}
> +
>  static void __tnode_free_rcu(struct rcu_head *head)
>  {
>  	struct tnode *tn = container_of(head, struct tnode, rcu);
> @@ -402,7 +413,7 @@ static void tnode_free_flush(void)
>  	while ((tn = tnode_free_head)) {
>  		tnode_free_head = tn->tnode_free;
>  		tn->tnode_free = NULL;
> -		tnode_free(tn);
> +		__tnode_free(tn);
>  	}
>  }
> 
> @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
>  				      (struct node *)tn, wasfull);
> 
>  		tp = node_parent((struct node *) tn);
> -		tnode_free_flush();
>  		if (!tp)
>  			break;
>  		tn = tp;
>  	}
> 
> +	if (tnode_free_head) {
> +		synchronize_rcu();
> +		tnode_free_flush();
> +	}
> +
>  	/* Handle last (top) tnode */
> -	if (IS_TNODE(tn))
> +	if (IS_TNODE(tn)) {
>  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
> -
> -	rcu_assign_pointer(t->trie, (struct node *)tn);
> -	tnode_free_flush();
> +		rcu_assign_pointer(t->trie, (struct node *)tn);
> +		synchronize_rcu();
> +		tnode_free_flush();
> +	} else {
> +		rcu_assign_pointer(t->trie, (struct node *)tn);
> +	}
> 
>  	return;
>  }
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 15:30                   ` Paul E. McKenney
@ 2009-06-26 15:54                     ` Jarek Poplawski
  2009-06-26 16:15                       ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-26 15:54 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Robert Olsson, Robert Olsson, Eric Dumazet,
	=?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On Fri, Jun 26, 2009 at 08:30:10AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote:
> > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote:
> > > 
> > > Jarek Poplawski writes:
> > > 
> > >  Thanks, 
> > > 
> > >  Should be worth testing so we synchronize_rcu instead of doing call_rcu's
> > >  
> > 
> > Alas take 2 (nor 1) doesn't compile, so here it is again.
> 
> So the idea is to balance memory and latency, so that large changes
> (those affecting the root node) get at least one synchronize_rcu(),
> while smaller changes just use call_rcu(), correct?  This means that
> the amount of memory awaiting an RCU grace period is limited, but
> the algorithm avoids per-node synchronize_rcu() overhead.
> 
> If I understand the goal correctly, looks good!  (Give or take my
> limited understanding of fib_trie and is usage, of course.)

The goal is practically to replace all call_rcu() during
trie_rebalance() with synchronize_rcu() (except some freeing after
ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this
memory after trie_rebalance() has finished, that's why there were
problems with enabled preemption. So this patch tries to do/force
this a bit earlier - at least before the top/largest node is
rebalanced.

Thanks,
Jarek P.

> 
> 							Thanx, Paul
> 
> > Thanks,
> > Jarek P.
> > --- (take 3 - for testing)
> > 
> >  net/ipv4/fib_trie.c |   30 ++++++++++++++++++++++++------
> >  1 files changed, 24 insertions(+), 6 deletions(-)
> > 
> > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> > index 012cf5a..1a4c4b7 100644
> > --- a/net/ipv4/fib_trie.c
> > +++ b/net/ipv4/fib_trie.c
> > @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg)
> >  	vfree(tn);
> >  }
> > 
> > +static void __tnode_free(struct tnode *tn)
> > +{
> > +	size_t size = sizeof(struct tnode) +
> > +		      (sizeof(struct node *) << tn->bits);
> > +
> > +	if (size <= PAGE_SIZE)
> > +		kfree(tn);
> > +	else
> > +		vfree(tn);
> > +}
> > +
> >  static void __tnode_free_rcu(struct rcu_head *head)
> >  {
> >  	struct tnode *tn = container_of(head, struct tnode, rcu);
> > @@ -402,7 +413,7 @@ static void tnode_free_flush(void)
> >  	while ((tn = tnode_free_head)) {
> >  		tnode_free_head = tn->tnode_free;
> >  		tn->tnode_free = NULL;
> > -		tnode_free(tn);
> > +		__tnode_free(tn);
> >  	}
> >  }
> > 
> > @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
> >  				      (struct node *)tn, wasfull);
> > 
> >  		tp = node_parent((struct node *) tn);
> > -		tnode_free_flush();
> >  		if (!tp)
> >  			break;
> >  		tn = tp;
> >  	}
> > 
> > +	if (tnode_free_head) {
> > +		synchronize_rcu();
> > +		tnode_free_flush();
> > +	}
> > +
> >  	/* Handle last (top) tnode */
> > -	if (IS_TNODE(tn))
> > +	if (IS_TNODE(tn)) {
> >  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
> > -
> > -	rcu_assign_pointer(t->trie, (struct node *)tn);
> > -	tnode_free_flush();
> > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > +		synchronize_rcu();
> > +		tnode_free_flush();
> > +	} else {
> > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > +	}
> > 
> >  	return;
> >  }
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 15:54                     ` Jarek Poplawski
@ 2009-06-26 16:15                       ` Jarek Poplawski
  2009-06-26 16:23                         ` Paul E. McKenney
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-26 16:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Robert Olsson, Robert Olsson, Eric Dumazet,
	=?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On Fri, Jun 26, 2009 at 05:54:10PM +0200, Jarek Poplawski wrote:
> On Fri, Jun 26, 2009 at 08:30:10AM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote:
> > > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote:
> > > > 
> > > > Jarek Poplawski writes:
> > > > 
> > > >  Thanks, 
> > > > 
> > > >  Should be worth testing so we synchronize_rcu instead of doing call_rcu's
> > > >  
> > > 
> > > Alas take 2 (nor 1) doesn't compile, so here it is again.
> > 
> > So the idea is to balance memory and latency, so that large changes
> > (those affecting the root node) get at least one synchronize_rcu(),
> > while smaller changes just use call_rcu(), correct?  This means that
> > the amount of memory awaiting an RCU grace period is limited, but
> > the algorithm avoids per-node synchronize_rcu() overhead.
> > 
> > If I understand the goal correctly, looks good!  (Give or take my
> > limited understanding of fib_trie and is usage, of course.)
> 
> The goal is practically to replace all call_rcu() during
> trie_rebalance() with synchronize_rcu() (except some freeing after
> ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this
> memory after trie_rebalance() has finished, that's why there were
> problems with enabled preemption. So this patch tries to do/force
> this a bit earlier - at least before the top/largest node is
> rebalanced.

On the other hand, we could probably stay with call_rcu() plus only
one synchronize_rcu() before the top node's resize() if you think it's
enough here?

Thanks,
Jarek P.

> 
> > 
> > 							Thanx, Paul
> > 
> > > Thanks,
> > > Jarek P.
> > > --- (take 3 - for testing)
> > > 
> > >  net/ipv4/fib_trie.c |   30 ++++++++++++++++++++++++------
> > >  1 files changed, 24 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> > > index 012cf5a..1a4c4b7 100644
> > > --- a/net/ipv4/fib_trie.c
> > > +++ b/net/ipv4/fib_trie.c
> > > @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg)
> > >  	vfree(tn);
> > >  }
> > > 
> > > +static void __tnode_free(struct tnode *tn)
> > > +{
> > > +	size_t size = sizeof(struct tnode) +
> > > +		      (sizeof(struct node *) << tn->bits);
> > > +
> > > +	if (size <= PAGE_SIZE)
> > > +		kfree(tn);
> > > +	else
> > > +		vfree(tn);
> > > +}
> > > +
> > >  static void __tnode_free_rcu(struct rcu_head *head)
> > >  {
> > >  	struct tnode *tn = container_of(head, struct tnode, rcu);
> > > @@ -402,7 +413,7 @@ static void tnode_free_flush(void)
> > >  	while ((tn = tnode_free_head)) {
> > >  		tnode_free_head = tn->tnode_free;
> > >  		tn->tnode_free = NULL;
> > > -		tnode_free(tn);
> > > +		__tnode_free(tn);
> > >  	}
> > >  }
> > > 
> > > @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
> > >  				      (struct node *)tn, wasfull);
> > > 
> > >  		tp = node_parent((struct node *) tn);
> > > -		tnode_free_flush();
> > >  		if (!tp)
> > >  			break;
> > >  		tn = tp;
> > >  	}
> > > 
> > > +	if (tnode_free_head) {
> > > +		synchronize_rcu();
> > > +		tnode_free_flush();
> > > +	}
> > > +
> > >  	/* Handle last (top) tnode */
> > > -	if (IS_TNODE(tn))
> > > +	if (IS_TNODE(tn)) {
> > >  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
> > > -
> > > -	rcu_assign_pointer(t->trie, (struct node *)tn);
> > > -	tnode_free_flush();
> > > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > > +		synchronize_rcu();
> > > +		tnode_free_flush();
> > > +	} else {
> > > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > > +	}
> > > 
> > >  	return;
> > >  }
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 16:15                       ` Jarek Poplawski
@ 2009-06-26 16:23                         ` Paul E. McKenney
  2009-06-26 16:45                           ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Paul E. McKenney @ 2009-06-26 16:23 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Robert Olsson, Robert Olsson, Eric Dumazet,
	=?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On Fri, Jun 26, 2009 at 06:15:00PM +0200, Jarek Poplawski wrote:
> On Fri, Jun 26, 2009 at 05:54:10PM +0200, Jarek Poplawski wrote:
> > On Fri, Jun 26, 2009 at 08:30:10AM -0700, Paul E. McKenney wrote:
> > > On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote:
> > > > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote:
> > > > > 
> > > > > Jarek Poplawski writes:
> > > > > 
> > > > >  Thanks, 
> > > > > 
> > > > >  Should be worth testing so we synchronize_rcu instead of doing call_rcu's
> > > > >  
> > > > 
> > > > Alas take 2 (nor 1) doesn't compile, so here it is again.
> > > 
> > > So the idea is to balance memory and latency, so that large changes
> > > (those affecting the root node) get at least one synchronize_rcu(),
> > > while smaller changes just use call_rcu(), correct?  This means that
> > > the amount of memory awaiting an RCU grace period is limited, but
> > > the algorithm avoids per-node synchronize_rcu() overhead.
> > > 
> > > If I understand the goal correctly, looks good!  (Give or take my
> > > limited understanding of fib_trie and is usage, of course.)
> > 
> > The goal is practically to replace all call_rcu() during
> > trie_rebalance() with synchronize_rcu() (except some freeing after
> > ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this
> > memory after trie_rebalance() has finished, that's why there were
> > problems with enabled preemption. So this patch tries to do/force
> > this a bit earlier - at least before the top/largest node is
> > rebalanced.
> 
> On the other hand, we could probably stay with call_rcu() plus only
> one synchronize_rcu() before the top node's resize() if you think it's
> enough here?

Well, my first task is to understand the problem/goal.  ;-)

My guess from what you said above is that use of call_rcu(), when
combined with changes to the trie in rapid succession, is resulting
in excessive memory awaiting a grace period.  Is this the case, or am I
confused?

							Thanx, Paul

> Thanks,
> Jarek P.
> 
> > 
> > > 
> > > 							Thanx, Paul
> > > 
> > > > Thanks,
> > > > Jarek P.
> > > > --- (take 3 - for testing)
> > > > 
> > > >  net/ipv4/fib_trie.c |   30 ++++++++++++++++++++++++------
> > > >  1 files changed, 24 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> > > > index 012cf5a..1a4c4b7 100644
> > > > --- a/net/ipv4/fib_trie.c
> > > > +++ b/net/ipv4/fib_trie.c
> > > > @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg)
> > > >  	vfree(tn);
> > > >  }
> > > > 
> > > > +static void __tnode_free(struct tnode *tn)
> > > > +{
> > > > +	size_t size = sizeof(struct tnode) +
> > > > +		      (sizeof(struct node *) << tn->bits);
> > > > +
> > > > +	if (size <= PAGE_SIZE)
> > > > +		kfree(tn);
> > > > +	else
> > > > +		vfree(tn);
> > > > +}
> > > > +
> > > >  static void __tnode_free_rcu(struct rcu_head *head)
> > > >  {
> > > >  	struct tnode *tn = container_of(head, struct tnode, rcu);
> > > > @@ -402,7 +413,7 @@ static void tnode_free_flush(void)
> > > >  	while ((tn = tnode_free_head)) {
> > > >  		tnode_free_head = tn->tnode_free;
> > > >  		tn->tnode_free = NULL;
> > > > -		tnode_free(tn);
> > > > +		__tnode_free(tn);
> > > >  	}
> > > >  }
> > > > 
> > > > @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
> > > >  				      (struct node *)tn, wasfull);
> > > > 
> > > >  		tp = node_parent((struct node *) tn);
> > > > -		tnode_free_flush();
> > > >  		if (!tp)
> > > >  			break;
> > > >  		tn = tp;
> > > >  	}
> > > > 
> > > > +	if (tnode_free_head) {
> > > > +		synchronize_rcu();
> > > > +		tnode_free_flush();
> > > > +	}
> > > > +
> > > >  	/* Handle last (top) tnode */
> > > > -	if (IS_TNODE(tn))
> > > > +	if (IS_TNODE(tn)) {
> > > >  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
> > > > -
> > > > -	rcu_assign_pointer(t->trie, (struct node *)tn);
> > > > -	tnode_free_flush();
> > > > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > > > +		synchronize_rcu();
> > > > +		tnode_free_flush();
> > > > +	} else {
> > > > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > > > +	}
> > > > 
> > > >  	return;
> > > >  }
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 16:23                         ` Paul E. McKenney
@ 2009-06-26 16:45                           ` Jarek Poplawski
  2009-06-26 17:05                             ` Paul E. McKenney
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-26 16:45 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Robert Olsson, Robert Olsson, Eric Dumazet,
	=?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On Fri, Jun 26, 2009 at 09:23:40AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 26, 2009 at 06:15:00PM +0200, Jarek Poplawski wrote:
> > On Fri, Jun 26, 2009 at 05:54:10PM +0200, Jarek Poplawski wrote:
> > > On Fri, Jun 26, 2009 at 08:30:10AM -0700, Paul E. McKenney wrote:
> > > > On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote:
> > > > > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote:
> > > > > > 
> > > > > > Jarek Poplawski writes:
> > > > > > 
> > > > > >  Thanks, 
> > > > > > 
> > > > > >  Should be worth testing so we synchronize_rcu instead of doing call_rcu's
> > > > > >  
> > > > > 
> > > > > Alas take 2 (nor 1) doesn't compile, so here it is again.
> > > > 
> > > > So the idea is to balance memory and latency, so that large changes
> > > > (those affecting the root node) get at least one synchronize_rcu(),
> > > > while smaller changes just use call_rcu(), correct?  This means that
> > > > the amount of memory awaiting an RCU grace period is limited, but
> > > > the algorithm avoids per-node synchronize_rcu() overhead.
> > > > 
> > > > If I understand the goal correctly, looks good!  (Give or take my
> > > > limited understanding of fib_trie and is usage, of course.)
> > > 
> > > The goal is practically to replace all call_rcu() during
> > > trie_rebalance() with synchronize_rcu() (except some freeing after
> > > ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this
> > > memory after trie_rebalance() has finished, that's why there were
> > > problems with enabled preemption. So this patch tries to do/force
> > > this a bit earlier - at least before the top/largest node is
> > > rebalanced.
> > 
> > On the other hand, we could probably stay with call_rcu() plus only
> > one synchronize_rcu() before the top node's resize() if you think it's
> > enough here?
> 
> Well, my first task is to understand the problem/goal.  ;-)
> 
> My guess from what you said above is that use of call_rcu(), when
> combined with changes to the trie in rapid succession, is resulting
> in excessive memory awaiting a grace period.  Is this the case, or am I
> confused?

Exactly! (I guess... ;-)

Thanks,
Jarek P.
> > 
> > > 
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > Thanks,
> > > > > Jarek P.
> > > > > --- (take 3 - for testing)
> > > > > 
> > > > >  net/ipv4/fib_trie.c |   30 ++++++++++++++++++++++++------
> > > > >  1 files changed, 24 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> > > > > index 012cf5a..1a4c4b7 100644
> > > > > --- a/net/ipv4/fib_trie.c
> > > > > +++ b/net/ipv4/fib_trie.c
> > > > > @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_struct *arg)
> > > > >  	vfree(tn);
> > > > >  }
> > > > > 
> > > > > +static void __tnode_free(struct tnode *tn)
> > > > > +{
> > > > > +	size_t size = sizeof(struct tnode) +
> > > > > +		      (sizeof(struct node *) << tn->bits);
> > > > > +
> > > > > +	if (size <= PAGE_SIZE)
> > > > > +		kfree(tn);
> > > > > +	else
> > > > > +		vfree(tn);
> > > > > +}
> > > > > +
> > > > >  static void __tnode_free_rcu(struct rcu_head *head)
> > > > >  {
> > > > >  	struct tnode *tn = container_of(head, struct tnode, rcu);
> > > > > @@ -402,7 +413,7 @@ static void tnode_free_flush(void)
> > > > >  	while ((tn = tnode_free_head)) {
> > > > >  		tnode_free_head = tn->tnode_free;
> > > > >  		tn->tnode_free = NULL;
> > > > > -		tnode_free(tn);
> > > > > +		__tnode_free(tn);
> > > > >  	}
> > > > >  }
> > > > > 
> > > > > @@ -1021,18 +1032,25 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
> > > > >  				      (struct node *)tn, wasfull);
> > > > > 
> > > > >  		tp = node_parent((struct node *) tn);
> > > > > -		tnode_free_flush();
> > > > >  		if (!tp)
> > > > >  			break;
> > > > >  		tn = tp;
> > > > >  	}
> > > > > 
> > > > > +	if (tnode_free_head) {
> > > > > +		synchronize_rcu();
> > > > > +		tnode_free_flush();
> > > > > +	}
> > > > > +
> > > > >  	/* Handle last (top) tnode */
> > > > > -	if (IS_TNODE(tn))
> > > > > +	if (IS_TNODE(tn)) {
> > > > >  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
> > > > > -
> > > > > -	rcu_assign_pointer(t->trie, (struct node *)tn);
> > > > > -	tnode_free_flush();
> > > > > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > > > > +		synchronize_rcu();
> > > > > +		tnode_free_flush();
> > > > > +	} else {
> > > > > +		rcu_assign_pointer(t->trie, (struct node *)tn);
> > > > > +	}
> > > > > 
> > > > >  	return;
> > > > >  }
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 16:45                           ` Jarek Poplawski
@ 2009-06-26 17:05                             ` Paul E. McKenney
  2009-06-26 18:05                               ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Paul E. McKenney @ 2009-06-26 17:05 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Robert Olsson, Robert Olsson, Eric Dumazet,
	=?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On Fri, Jun 26, 2009 at 06:45:57PM +0200, Jarek Poplawski wrote:
> On Fri, Jun 26, 2009 at 09:23:40AM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 26, 2009 at 06:15:00PM +0200, Jarek Poplawski wrote:
> > > On Fri, Jun 26, 2009 at 05:54:10PM +0200, Jarek Poplawski wrote:
> > > > On Fri, Jun 26, 2009 at 08:30:10AM -0700, Paul E. McKenney wrote:
> > > > > On Fri, Jun 26, 2009 at 05:10:52PM +0200, Jarek Poplawski wrote:
> > > > > > On Fri, Jun 26, 2009 at 03:52:55PM +0200, Robert Olsson wrote:
> > > > > > > 
> > > > > > > Jarek Poplawski writes:
> > > > > > > 
> > > > > > >  Thanks, 
> > > > > > > 
> > > > > > >  Should be worth testing so we synchronize_rcu instead of doing call_rcu's
> > > > > > >  
> > > > > > 
> > > > > > Alas take 2 (nor 1) doesn't compile, so here it is again.
> > > > > 
> > > > > So the idea is to balance memory and latency, so that large changes
> > > > > (those affecting the root node) get at least one synchronize_rcu(),
> > > > > while smaller changes just use call_rcu(), correct?  This means that
> > > > > the amount of memory awaiting an RCU grace period is limited, but
> > > > > the algorithm avoids per-node synchronize_rcu() overhead.
> > > > > 
> > > > > If I understand the goal correctly, looks good!  (Give or take my
> > > > > limited understanding of fib_trie and is usage, of course.)
> > > > 
> > > > The goal is practically to replace all call_rcu() during
> > > > trie_rebalance() with synchronize_rcu() (except some freeing after
> > > > ENOMEM). I guess currently (<= 2.6.30) call_rcu() can free this
> > > > memory after trie_rebalance() has finished, that's why there were
> > > > problems with enabled preemption. So this patch tries to do/force
> > > > this a bit earlier - at least before the top/largest node is
> > > > rebalanced.
> > > 
> > > On the other hand, we could probably stay with call_rcu() plus only
> > > one synchronize_rcu() before the top node's resize() if you think it's
> > > enough here?
> > 
> > Well, my first task is to understand the problem/goal.  ;-)
> > 
> > My guess from what you said above is that use of call_rcu(), when
> > combined with changes to the trie in rapid succession, is resulting
> > in excessive memory awaiting a grace period.  Is this the case, or am I
> > confused?
> 
> Exactly! (I guess... ;-)

;-)

In that case, simply invoking synchronize_rcu() every once and awhile
should take care of things.  This could be at the end of every large
trie operation, or you could even count the call_rcu() invocations and
do a synchronize_rcu() every 100th, 1,000th, or whatever, based on
the amount of memory available.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 10:52             ` Eric Dumazet
@ 2009-06-26 17:26               ` Paweł Staszewski
  0 siblings, 0 replies; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-26 17:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Network Development list


> OK thanks
>
> Please report (while machine has enough load) output of
>
> rtstat -c20 -i1
>
> (rtstat is a symbolic link to lnstat, if not provided by your distro)
>
>
>
>   

here you have
rtstat  -i 1 -c 20
rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|rt_cache|
 entries|  in_hit|in_slow_|in_slow_|in_no_ro|  in_brd|in_marti|in_marti| 
out_hit|out_slow|out_slow|gc_total|gc_ignor|gc_goal_|gc_dst_o|in_hlist|out_hlis|
        |        |     tot|      mc|     ute|        |  an_dst|  
an_src|        |    _tot|     _mc|        |      ed|    miss| verflow| 
_search|t_search|
   93362|22930850| 1671866|       0|    1369|       2|       0|       
0|   53432|    1324|       0|       0|       0|       0|       0| 
4783896|   11985|
   92067|  101426|    5315|       0|       2|       0|       0|       
0|     258|       2|       0|       0|       0|       0|       0|   
21893|       6|
   90561|  100094|    4666|       0|       6|       0|       0|       
0|     267|       1|       0|       0|       0|       0|       0|   
23433|      30|
   90101|   98672|    5630|       0|       2|       0|       0|       
0|     253|       0|       0|       0|       0|       0|       0|   
24386|      34|
   89994|   99962|    5654|       0|       6|       0|       0|       
0|     266|       2|       0|       0|       0|       0|       0|   
26251|      38|
   95209|   91974|   14860|       0|       9|       0|       0|       
0|     236|      31|       0|       0|       0|       0|       0|   
14238|      35|
   95323|  101714|   10126|       0|      14|       0|       0|       
0|     255|       9|       0|       0|       0|       0|       0|    
8532|      21|
   94814|   99918|    8539|       0|       5|       0|       0|       
0|     258|       4|       0|       0|       0|       0|       0|   
11069|      24|
   98510|   93929|   12672|       0|      13|       0|       0|       
0|     238|      31|       0|       0|       0|       0|       0|   
12704|      34|
   98983|   96131|   11128|       0|      12|       0|       0|       
0|     252|      10|       0|       0|       0|       0|       0|    
7142|      18|
   98824|   99036|    8995|       0|       5|       0|       0|       
0|     256|       3|       0|       0|       0|       0|       0|    
9343|      16|
   97868|  100032|    7544|       0|       5|       0|       0|       
0|     254|       1|       0|       0|       0|       0|       0|   
11902|      17|
   96929|  101942|    6722|       0|       7|       0|       0|       
0|     263|       3|       0|       0|       0|       0|       0|   
13778|      46|
   95932|  100725|    6217|       0|       4|       0|       0|       
0|     259|       3|       0|       0|       0|       0|       0|   
15175|      47|
   94432|  102074|    5549|       0|       7|       0|       0|       
0|     268|       2|       0|       0|       0|       0|       0|   
16996|      54|
   92986|  103602|    5187|       0|       3|       0|       0|       
0|     260|       0|       0|       0|       0|       0|       0|   
18333|      43|
   91387|  103934|    4666|       0|       6|       0|       0|       
0|     261|       3|       0|       0|       0|       0|       0|   
19316|      46|
   90615|  104916|    5333|       0|       5|       0|       0|       
0|     260|       4|       0|       0|       0|       0|       0|   
21376|      48|
   89941|  101375|    5189|       0|       8|       0|       0|       
0|     270|       0|       0|       0|       0|       0|       0|   
22249|      47|
   89744|  101425|    5529|       0|       6|       0|       0|       
0|     263|       4|       0|       0|       0|       0|       0|   
24089|      57|


ath the same time cpu load
19:24:34     CPU    %usr   %nice    %sys %iowait    %irq   %soft  
%steal  %guest   %idle
19:24:35     all    0.00    0.00    0.00    0.00    1.39   20.36    
0.00    0.00   78.25
19:24:35       0    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:35       1    0.00    0.00    0.00    0.00    5.00   74.00    
0.00    0.00   21.00
19:24:35       2    0.00    0.00    0.00    0.00    5.00   73.00    
0.00    0.00   22.00
19:24:35       3    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:35       4    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:35       5    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:35       6    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:35       7    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00

19:24:35     CPU    %usr   %nice    %sys %iowait    %irq   %soft  
%steal  %guest   %idle
19:24:36     all    0.00    0.00    0.00    0.00    1.21   16.03    
0.00    0.00   82.77
19:24:36       0    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:36       1    0.00    0.00    0.00    0.00    5.05   75.76    
0.00    0.00   19.19
19:24:36       2    0.00    0.00    0.99    0.00    5.94   69.31    
0.00    0.00   23.76
19:24:36       3    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:36       4    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:36       5    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:36       6    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:36       7    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00

19:24:36     CPU    %usr   %nice    %sys %iowait    %irq   %soft  
%steal  %guest   %idle
19:24:37     all    0.00    0.00    0.14    0.00    1.64   20.19    
0.00    0.00   78.04
19:24:37       0    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:37       1    0.00    0.00    0.00    0.00    5.94   73.27    
0.00    0.00   20.79
19:24:37       2    0.00    0.00    0.00    0.00    7.00   75.00    
0.00    0.00   18.00
19:24:37       3    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:37       4    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:37       5    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:37       6    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:37       7    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00

19:24:37     CPU    %usr   %nice    %sys %iowait    %irq   %soft  
%steal  %guest   %idle
19:24:38     all    0.00    0.00    0.00    0.00    0.90   14.24    
0.00    0.00   84.86
19:24:38       0    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:38       1    0.00    0.00    0.00    0.00    4.00   73.00    
0.00    0.00   23.00
19:24:38       2    0.00    0.00    0.00    0.00    5.05   69.70    
0.00    0.00   25.25
19:24:38       3    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:38       4    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:38       5    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:38       6    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:38       7    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00

19:24:38     CPU    %usr   %nice    %sys %iowait    %irq   %soft  
%steal  %guest   %idle
19:24:39     all    0.00    0.00    0.00    0.00    2.43   29.80    
0.00    0.00   67.77
19:24:39       0    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:39       1    0.00    0.00    0.00    0.00    5.00   67.00    
0.00    0.00   28.00
19:24:39       2    0.00    0.00    0.00    0.00    5.94   67.33    
0.00    0.00   26.73
19:24:39       3    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:39       4    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:39       5    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:39       6    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:39       7    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00

19:24:39     CPU    %usr   %nice    %sys %iowait    %irq   %soft  
%steal  %guest   %idle
19:24:40     all    0.00    0.00    0.00    0.00    1.43   14.15    
0.00    0.00   84.42
19:24:40       0    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:40       1    0.00    0.00    0.00    0.00    5.94   68.32    
0.00    0.00   25.74
19:24:40       2    0.00    0.00    0.00    0.00    7.07   71.72    
0.00    0.00   21.21
19:24:40       3    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:40       4    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:40       5    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:40       6    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:40       7    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00

19:24:40     CPU    %usr   %nice    %sys %iowait    %irq   %soft  
%steal  %guest   %idle
19:24:41     all    0.00    0.00    0.00    0.00    1.40   17.68    
0.00    0.00   80.92
19:24:41       0    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:41       1    0.00    0.00    0.00    0.00    6.06   70.71    
0.00    0.00   23.23
19:24:41       2    0.00    0.00    0.00    0.00    5.88   67.65    
0.00    0.00   26.47
19:24:41       3    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:41       4    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:41       5    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:41       6    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00
19:24:41       7    0.00    0.00    0.00    0.00    0.00    0.00    
0.00    0.00  100.00

cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3       CPU4       
CPU5       CPU6       CPU7
  0:         42          0          0          1          0          
1          0          0   IO-APIC-edge      timer
  1:          0          0          0          1          0          
0          0          1   IO-APIC-edge      i8042
  9:          0          0          0          0          0          
0          0          0   IO-APIC-fasteoi   acpi
 14:          0          0          0          0          0          
0          0          0   IO-APIC-edge      ide0
 15:          0          0          0          0          0          
0          0          0   IO-APIC-edge      ide1
 29:    1139988    4692793      89662          3          0          
1          0          3   PCI-MSI-edge      eth0
 30:          0          2    6207546          1          0          
3          0          0   PCI-MSI-edge      eth1
 31:          0          1          1          0          0          
0          0          0   PCI-MSI-edge
 32:          0          0          0          0          0          
0          2          0   PCI-MSI-edge
 33:          1          1          0          0          0          
0          0          0   PCI-MSI-edge
 34:          0          0          0          1          0          
1          0          0   PCI-MSI-edge
 35:          0          0          0          1          0          
0          0          1   PCI-MSI-edge
 36:          0          0          0          0          1          
0          0          1   PCI-MSI-edge
 37:          1          0          0          0          0          
1          0          0   PCI-MSI-edge
 38:          0          0          1          0          1          
0          0          0   PCI-MSI-edge
 39:          0          0          2          0          0          
0          0          0   PCI-MSI-edge
 40:          0          0          0          0          0          
0          2          0   PCI-MSI-edge
 41:          0          2          0          0          0          
0          0          0   PCI-MSI-edge
 42:          0          0          0          0          0          
2          0          0   PCI-MSI-edge
 43:          0          0          0          2          0          
0          0          0   PCI-MSI-edge
 44:          0          0          0          0          0          
0          0          2   PCI-MSI-edge
 45:          2          0          0          0          0          
0          0          0   PCI-MSI-edge
 46:          0          0          0          0          2          
0          0          0   PCI-MSI-edge
 48:        191        200        185        213        219        
219        227        214   PCI-MSI-edge      ahci
 49:          0          1          1          0          0          
2          1          0   PCI-MSI-edge      ioat-msi
NMI:          0          0          0          0          0          
0          0          0   Non-maskable interrupts
LOC:    1083019    6233788    7735401      12394      15178      
10718      21192       8515   Local timer interrupts
RES:        921         44         33         20         13          
8         10         12   Rescheduling interrupts
CAL:         20         85         88         87         90         
90         91         86   Function call interrupts
TLB:        103        114        918        929         95        
113        973        990   TLB shootdowns
SPU:          0          0          0          0          0          
0          0          0   Spurious interrupts
ERR:          0
MIS:          0

i use smp_affinity
eth0 is on cpu1
eth1 is on cpu2

all test on kernel 2.6.29.5

and only one info in dmesg about Fix inflate_threshold
Fix inflate_threshold_root. Now=15 size=11 bits

This info appear when bgpd process start to learn routes from peers




^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 17:05                             ` Paul E. McKenney
@ 2009-06-26 18:05                               ` Jarek Poplawski
  2009-06-26 18:21                                 ` Paul E. McKenney
  2009-06-26 20:26                                 ` Robert Olsson
  0 siblings, 2 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-26 18:05 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Robert Olsson, Robert Olsson, Eric Dumazet,
	=?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On Fri, Jun 26, 2009 at 10:05:38AM -0700, Paul E. McKenney wrote:
...
> In that case, simply invoking synchronize_rcu() every once and awhile
> should take care of things.  This could be at the end of every large
> trie operation, or you could even count the call_rcu() invocations and
> do a synchronize_rcu() every 100th, 1,000th, or whatever, based on
> the amount of memory available.

OK, for now the minimal change for testing (2.6.30 needs previously
mentioned two commits from 2.6.31-rc). (I guess I'll send it with a
changelog after net-next is opened.)

Thanks,
Jarek P.
--- (take 4 - for testing)

 net/ipv4/fib_trie.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..98b31a1 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1008,7 +1008,7 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 {
 	int wasfull;
 	t_key cindex, key;
-	struct tnode *tp;
+	struct tnode *tp, *oldtnode = tn;
 
 	key = tn->key;
 
@@ -1028,8 +1028,12 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 	}
 
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn))
+	if (IS_TNODE(tn)) {
+		/* force memory freeing after last changes */
+		if (oldtnode != tn)
+			synchronize_rcu();
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
+	}
 
 	rcu_assign_pointer(t->trie, (struct node *)tn);
 	tnode_free_flush();

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 18:05                               ` Jarek Poplawski
@ 2009-06-26 18:21                                 ` Paul E. McKenney
  2009-06-26 20:19                                   ` Jarek Poplawski
  2009-06-26 20:26                                 ` Robert Olsson
  1 sibling, 1 reply; 99+ messages in thread
From: Paul E. McKenney @ 2009-06-26 18:21 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Robert Olsson, Robert Olsson, Eric Dumazet,
	=?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On Fri, Jun 26, 2009 at 08:05:45PM +0200, Jarek Poplawski wrote:
> On Fri, Jun 26, 2009 at 10:05:38AM -0700, Paul E. McKenney wrote:
> ...
> > In that case, simply invoking synchronize_rcu() every once and awhile
> > should take care of things.  This could be at the end of every large
> > trie operation, or you could even count the call_rcu() invocations and
> > do a synchronize_rcu() every 100th, 1,000th, or whatever, based on
> > the amount of memory available.
> 
> OK, for now the minimal change for testing (2.6.30 needs previously
> mentioned two commits from 2.6.31-rc). (I guess I'll send it with a
> changelog after net-next is opened.)

Looks promising to me!!!

							Thanx, Paul

> Thanks,
> Jarek P.
> --- (take 4 - for testing)
> 
>  net/ipv4/fib_trie.c |    8 ++++++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 012cf5a..98b31a1 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -1008,7 +1008,7 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
>  {
>  	int wasfull;
>  	t_key cindex, key;
> -	struct tnode *tp;
> +	struct tnode *tp, *oldtnode = tn;
> 
>  	key = tn->key;
> 
> @@ -1028,8 +1028,12 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
>  	}
> 
>  	/* Handle last (top) tnode */
> -	if (IS_TNODE(tn))
> +	if (IS_TNODE(tn)) {
> +		/* force memory freeing after last changes */
> +		if (oldtnode != tn)
> +			synchronize_rcu();
>  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
> +	}
> 
>  	rcu_assign_pointer(t->trie, (struct node *)tn);
>  	tnode_free_flush();

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 18:21                                 ` Paul E. McKenney
@ 2009-06-26 20:19                                   ` Jarek Poplawski
  0 siblings, 0 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-26 20:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Robert Olsson, Robert Olsson, Eric Dumazet,
	=?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On Fri, Jun 26, 2009 at 11:21:43AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 26, 2009 at 08:05:45PM +0200, Jarek Poplawski wrote:
> > On Fri, Jun 26, 2009 at 10:05:38AM -0700, Paul E. McKenney wrote:
> > ...
> > > In that case, simply invoking synchronize_rcu() every once and awhile
> > > should take care of things.  This could be at the end of every large
> > > trie operation, or you could even count the call_rcu() invocations and
> > > do a synchronize_rcu() every 100th, 1,000th, or whatever, based on
> > > the amount of memory available.
> > 
> > OK, for now the minimal change for testing (2.6.30 needs previously
> > mentioned two commits from 2.6.31-rc). (I guess I'll send it with a
> > changelog after net-next is opened.)
> 
> Looks promising to me!!!
> 

Alas, after rethinking, there is one detail which bothers me. Those
largest allocs here are done with vmalloc and freed with RCU by
schedule_work(). So, I wonder if just because of this, the previous
version doing it directly isn't more reliable anyway. Of course, it's
my bad I didn't point it while describing the problem earlier. (I knew
I missed something...;-)

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 18:05                               ` Jarek Poplawski
  2009-06-26 18:21                                 ` Paul E. McKenney
@ 2009-06-26 20:26                                 ` Robert Olsson
  2009-06-26 20:37                                   ` Jarek Poplawski
  1 sibling, 1 reply; 99+ messages in thread
From: Robert Olsson @ 2009-06-26 20:26 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Paul E. McKenney, Robert Olsson, Eric Dumazet,
	=?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list



Yes looks like a good solution but maybe it safest to synchronize unconditionally?


diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..9cb8623 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1028,8 +1028,11 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 	}
 
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn))
+	if (IS_TNODE(tn)) {
+		/* force memory freeing after last changes */
+		synchronize_rcu();
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
+	}
 
 	rcu_assign_pointer(t->trie, (struct node *)tn);
 	tnode_free_flush();

Cheers
						--ro

Jarek Poplawski writes:

 >  net/ipv4/fib_trie.c |    8 ++++++--
 >  1 files changed, 6 insertions(+), 2 deletions(-)
 > 
 > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
 > index 012cf5a..98b31a1 100644
 > --- a/net/ipv4/fib_trie.c
 > +++ b/net/ipv4/fib_trie.c
 > @@ -1008,7 +1008,7 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 >  {
 >  	int wasfull;
 >  	t_key cindex, key;
 > -	struct tnode *tp;
 > +	struct tnode *tp, *oldtnode = tn;
 >  
 >  	key = tn->key;
 >  
 > @@ -1028,8 +1028,12 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 >  	}
 >  
 >  	/* Handle last (top) tnode */
 > -	if (IS_TNODE(tn))
 > +	if (IS_TNODE(tn)) {
 > +		/* force memory freeing after last changes */
 > +		if (oldtnode != tn)
 > +			synchronize_rcu();
 >  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
 > +	}
 >  
 >  	rcu_assign_pointer(t->trie, (struct node *)tn);
 >  	tnode_free_flush();
 > --
 > To unsubscribe from this list: send the line "unsubscribe netdev" in
 > the body of a message to majordomo@vger.kernel.org
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 20:26                                 ` Robert Olsson
@ 2009-06-26 20:37                                   ` Jarek Poplawski
  2009-06-26 21:20                                     ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-26 20:37 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Paul E. McKenney, Robert Olsson, Eric Dumazet,
	=?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On Fri, Jun 26, 2009 at 10:26:53PM +0200, Robert Olsson wrote:
> 
> 
> Yes looks like a good solution but maybe it safest to synchronize unconditionally?

Hmm... I lost around half an hour for this doubt... Sure! (Unless
there are some strange cases which very often create and destroy very
small tables?)

Thanks,
Jarek P.

> 
> 
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 012cf5a..9cb8623 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -1028,8 +1028,11 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
>  	}
>  
>  	/* Handle last (top) tnode */
> -	if (IS_TNODE(tn))
> +	if (IS_TNODE(tn)) {
> +		/* force memory freeing after last changes */
> +		synchronize_rcu();
>  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
> +	}
>  
>  	rcu_assign_pointer(t->trie, (struct node *)tn);
>  	tnode_free_flush();
> 
> Cheers
> 						--ro
> 
> Jarek Poplawski writes:
> 
>  >  net/ipv4/fib_trie.c |    8 ++++++--
>  >  1 files changed, 6 insertions(+), 2 deletions(-)
>  > 
>  > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>  > index 012cf5a..98b31a1 100644
>  > --- a/net/ipv4/fib_trie.c
>  > +++ b/net/ipv4/fib_trie.c
>  > @@ -1008,7 +1008,7 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
>  >  {
>  >  	int wasfull;
>  >  	t_key cindex, key;
>  > -	struct tnode *tp;
>  > +	struct tnode *tp, *oldtnode = tn;
>  >  
>  >  	key = tn->key;
>  >  
>  > @@ -1028,8 +1028,12 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
>  >  	}
>  >  
>  >  	/* Handle last (top) tnode */
>  > -	if (IS_TNODE(tn))
>  > +	if (IS_TNODE(tn)) {
>  > +		/* force memory freeing after last changes */
>  > +		if (oldtnode != tn)
>  > +			synchronize_rcu();
>  >  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
>  > +	}
>  >  
>  >  	rcu_assign_pointer(t->trie, (struct node *)tn);
>  >  	tnode_free_flush();
>  > --
>  > To unsubscribe from this list: send the line "unsubscribe netdev" in
>  > the body of a message to majordomo@vger.kernel.org
>  > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26 20:37                                   ` Jarek Poplawski
@ 2009-06-26 21:20                                     ` Jarek Poplawski
  0 siblings, 0 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-26 21:20 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Paul E. McKenney, Robert Olsson, Eric Dumazet,
	=?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Robert Olsson, Linux Network Development list

On Fri, Jun 26, 2009 at 10:37:13PM +0200, Jarek Poplawski wrote:
> On Fri, Jun 26, 2009 at 10:26:53PM +0200, Robert Olsson wrote:
> > 
> > 
> > Yes looks like a good solution but maybe it safest to synchronize unconditionally?
> 
> Hmm... I lost around half an hour for this doubt... Sure! (Unless
> there are some strange cases which very often create and destroy very
> small tables?)

...or maybe even only updating such small tables very often?

Btw., Robert, I wondered about some design details lately, especially
about pointer to a parent. I didn't see it in the basic docs, so it
seems it could be avoided. It seems to be a problem with RCU, unless I
miss something: if there were no going back from children to parents
it seems we could free those "temporary" (created by inflate() and
halve() and destroyed before resize() has finished) earlier.

Another problem with this, it seems, are possibly false lookups (if we
go back to the new parent which doesn't have it's parent or other nodes
updated). So, was there so much performance gain to introduce this?

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-26  9:19     ` Robert Olsson
  2009-06-26  9:37       ` Jarek Poplawski
@ 2009-06-27 19:20       ` Jarek Poplawski
  2009-06-27 20:51         ` Jarek Poplawski
  2009-06-28 11:04         ` Robert Olsson
  1 sibling, 2 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-27 19:20 UTC (permalink / raw)
  To: Robert Olsson
  Cc: =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

Robert Olsson wrote, On 06/26/2009 11:19 AM:

> Jarek Poplawski writes:
> 
>  > >> oprofile: using NMI interrupt.
>  > >> Fix inflate_threshold_root. Now=15 size=11 bits
>  > >> Fix inflate_threshold_root. Now=15 size=11 bits
>  > >> Fix inflate_threshold_root. Now=15 size=11 bits
>  > >> Fix inflate_threshold_root. Now=15 size=11 bits
>  > >> Fix inflate_threshold_root. Now=15 size=11 bits
>  > >> Fix inflate_threshold_root. Now=15 size=11 bits
> 
>  > On the other hand, even if there is no problem with memory, it seems
>  > because of hitting max_resize the threshold should be changed, e.g.
>  > by reverting the patch below.
> 
>  You seem to have some temporary memory problem. So the printout might be
>  a bit misleading in this case. We really like to keep the root node as big 
>  as we can to keep the tree as flat as possible for performance reasons.
>  (We're even more motivated now when we can disable the route cache)
> 
>  So I'll guess the next insert/delete inflates the root node to be within
>  the interval. So I'll assume this just a temporary failure?
> 
>  I would be nice to have *threshholds* settable by /proc or /sys. I would
>  use this in the other direction to trade memory for even faster lookups. 
>  
>  But maybe experts memory allocation has some good suggestions.

Robert, you and Eric pointed at memory problems, so I thought I missed
something. But after the second look I see "skipped node resize" should
show this, but it's always zero on these reports. So, isn't it possible
the current inflate_threshold_root is simply unreachable with some
conditions, at least within 10 loops?

Then these settable thresholds might be more useful here than memory
fixes, but here is some idea to try handle this automatically within
some limits. The patch below increases inflate_threshold_root (only)
up to ~50% of its initial value if needed, and should be able to go
back sometimes.

Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with
some offsets.)

Thanks,
Jarek P.
---

 net/ipv4/fib_trie.c |   23 ++++++++++++++++-------
 1 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..1dc1bb4 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -318,6 +318,7 @@ static const int halve_threshold = 25;
 static const int inflate_threshold = 50;
 static const int halve_threshold_root = 8;
 static const int inflate_threshold_root = 15;
+static int inflate_threshold_root_fix;
 
 
 static void __alias_free_mem(struct rcu_head *head)
@@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 	/* Keep root node larger  */
 
 	if (!tn->parent)
-		inflate_threshold_use = inflate_threshold_root;
+		inflate_threshold_use = inflate_threshold_root +
+					inflate_threshold_root_fix;
 	else
 		inflate_threshold_use = inflate_threshold;
 
@@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 	}
 
 	if (max_resize < 0) {
-		if (!tn->parent)
-			pr_warning("Fix inflate_threshold_root."
-				   " Now=%d size=%d bits\n",
-				   inflate_threshold_root, tn->bits);
-		else
+		if (!tn->parent) {
+			if (inflate_threshold_root_fix * 2 <
+			    inflate_threshold_root)
+				inflate_threshold_root_fix++;
+			else
+				pr_warning("Fix inflate_threshold_root."
+					   " Now=%d size=%d bits fix=%d\n",
+					   inflate_threshold_root, tn->bits,
+					   inflate_threshold_root_fix);
+		} else {
 			pr_warning("Fix inflate_threshold."
 				   " Now=%d size=%d bits\n",
 				   inflate_threshold, tn->bits);
-	}
+		}
+	} else if (max_resize < 5 && !tn->parent && inflate_threshold_root_fix)
+		inflate_threshold_root_fix--;
 
 	check_tnode(tn);
 

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-27 19:20       ` Jarek Poplawski
@ 2009-06-27 20:51         ` Jarek Poplawski
  2009-06-28  0:28           ` Paweł Staszewski
  2009-06-28 11:11           ` Robert Olsson
  2009-06-28 11:04         ` Robert Olsson
  1 sibling, 2 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-27 20:51 UTC (permalink / raw)
  To: Robert Olsson
  Cc: =?ISO-8859-2?Q?Pawe=B3_Staszewski?=, Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

On Sat, Jun 27, 2009 at 09:20:57PM +0200, Jarek Poplawski wrote:
...
> Then these settable thresholds might be more useful here than memory
> fixes, but here is some idea to try handle this automatically within
> some limits. The patch below increases inflate_threshold_root (only)
> up to ~50% of its initial value if needed, and should be able to go
> back sometimes.
> 
> Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with
> some offsets.)

A tiny adjustment in the last if...

Jarek P.
--- (take 2)

 net/ipv4/fib_trie.c |   23 ++++++++++++++++-------
 1 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..1dc1bb4 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -318,6 +318,7 @@ static const int halve_threshold = 25;
 static const int inflate_threshold = 50;
 static const int halve_threshold_root = 8;
 static const int inflate_threshold_root = 15;
+static int inflate_threshold_root_fix;
 
 
 static void __alias_free_mem(struct rcu_head *head)
@@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 	/* Keep root node larger  */
 
 	if (!tn->parent)
-		inflate_threshold_use = inflate_threshold_root;
+		inflate_threshold_use = inflate_threshold_root +
+					inflate_threshold_root_fix;
 	else
 		inflate_threshold_use = inflate_threshold;
 
@@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 	}
 
 	if (max_resize < 0) {
-		if (!tn->parent)
-			pr_warning("Fix inflate_threshold_root."
-				   " Now=%d size=%d bits\n",
-				   inflate_threshold_root, tn->bits);
-		else
+		if (!tn->parent) {
+			if (inflate_threshold_root_fix * 2 <
+			    inflate_threshold_root)
+				inflate_threshold_root_fix++;
+			else
+				pr_warning("Fix inflate_threshold_root."
+					   " Now=%d size=%d bits fix=%d\n",
+					   inflate_threshold_root, tn->bits,
+					   inflate_threshold_root_fix);
+		} else {
 			pr_warning("Fix inflate_threshold."
 				   " Now=%d size=%d bits\n",
 				   inflate_threshold, tn->bits);
-	}
+		}
+	} else if (max_resize > 4 && !tn->parent && inflate_threshold_root_fix)
+		inflate_threshold_root_fix--;
 
 	check_tnode(tn);
 

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-27 20:51         ` Jarek Poplawski
@ 2009-06-28  0:28           ` Paweł Staszewski
  2009-06-28 11:11           ` Robert Olsson
  1 sibling, 0 replies; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-28  0:28 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Robert Olsson, Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

Thanks Jarek

I apply this patch to 2.6.29.5
For some results we must wait to "rush hours" when there will be more 
traffic / routes :)



Regards
Paweł Staszewski


Jarek Poplawski pisze:
> On Sat, Jun 27, 2009 at 09:20:57PM +0200, Jarek Poplawski wrote:
> ...
>   
>> Then these settable thresholds might be more useful here than memory
>> fixes, but here is some idea to try handle this automatically within
>> some limits. The patch below increases inflate_threshold_root (only)
>> up to ~50% of its initial value if needed, and should be able to go
>> back sometimes.
>>
>> Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with
>> some offsets.)
>>     
>
> A tiny adjustment in the last if...
>
> Jarek P.
> --- (take 2)
>
>  net/ipv4/fib_trie.c |   23 ++++++++++++++++-------
>  1 files changed, 16 insertions(+), 7 deletions(-)
>
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 012cf5a..1dc1bb4 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -318,6 +318,7 @@ static const int halve_threshold = 25;
>  static const int inflate_threshold = 50;
>  static const int halve_threshold_root = 8;
>  static const int inflate_threshold_root = 15;
> +static int inflate_threshold_root_fix;
>  
>  
>  static void __alias_free_mem(struct rcu_head *head)
> @@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn)
>  	/* Keep root node larger  */
>  
>  	if (!tn->parent)
> -		inflate_threshold_use = inflate_threshold_root;
> +		inflate_threshold_use = inflate_threshold_root +
> +					inflate_threshold_root_fix;
>  	else
>  		inflate_threshold_use = inflate_threshold;
>  
> @@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn)
>  	}
>  
>  	if (max_resize < 0) {
> -		if (!tn->parent)
> -			pr_warning("Fix inflate_threshold_root."
> -				   " Now=%d size=%d bits\n",
> -				   inflate_threshold_root, tn->bits);
> -		else
> +		if (!tn->parent) {
> +			if (inflate_threshold_root_fix * 2 <
> +			    inflate_threshold_root)
> +				inflate_threshold_root_fix++;
> +			else
> +				pr_warning("Fix inflate_threshold_root."
> +					   " Now=%d size=%d bits fix=%d\n",
> +					   inflate_threshold_root, tn->bits,
> +					   inflate_threshold_root_fix);
> +		} else {
>  			pr_warning("Fix inflate_threshold."
>  				   " Now=%d size=%d bits\n",
>  				   inflate_threshold, tn->bits);
> -	}
> +		}
> +	} else if (max_resize > 4 && !tn->parent && inflate_threshold_root_fix)
> +		inflate_threshold_root_fix--;
>  
>  	check_tnode(tn);
>  
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>   


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-27 19:20       ` Jarek Poplawski
  2009-06-27 20:51         ` Jarek Poplawski
@ 2009-06-28 11:04         ` Robert Olsson
  2009-06-28 12:03           ` Jarek Poplawski
  2009-06-28 14:35           ` Jarek Poplawski
  1 sibling, 2 replies; 99+ messages in thread
From: Robert Olsson @ 2009-06-28 11:04 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Robert Olsson, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list


Jarek Poplawski writes:

 > Robert, you and Eric pointed at memory problems, so I thought I missed
 > something. But after the second look I see "skipped node resize" should
 > show this, but it's always zero on these reports. So, isn't it possible
 > the current inflate_threshold_root is simply unreachable with some
 > conditions, at least within 10 loops?
 > 
 > Then these settable thresholds might be more useful here than memory
 > fixes, but here is some idea to try handle this automatically within
 > some limits. The patch below increases inflate_threshold_root (only)
 > up to ~50% of its initial value if needed, and should be able to go
 > back sometimes.

 Yes we keep the old tnode size and the convergence interval was some
 of the concerns. That why this checks was added. Still we want to
 inflate the root node to a very max. 

 So this approach with halving or doubling tnodes towards the root
 node was the suggest by Dyntree paper. I asked Stefan (one of the
 authors) if we could get safe and very offensive settings. But 
 from what I understood there was no easy way to calculate this. 
 So any bright ideas in this area are very welcome. But we should
 also monitor size of root and average tree depth so we don't
 take an to defensive approach just to solve this case.

 The memory patches and "manual RCU" are also interesting to address
 the case with PREEMTP's.

 Inserts and deletes are also very fast due to the flat tree so I think
 we can "slow down" this a bit if need to be safe with all PREEMPT's.

 Thanks for giving this area energy.

 Cheers
					--ro


 > Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with
 > some offsets.)
 > 
 > Thanks,
 > Jarek P.
 > ---
 > 
 >  net/ipv4/fib_trie.c |   23 ++++++++++++++++-------
 >  1 files changed, 16 insertions(+), 7 deletions(-)
 > 
 > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
 > index 012cf5a..1dc1bb4 100644
 > --- a/net/ipv4/fib_trie.c
 > +++ b/net/ipv4/fib_trie.c
 > @@ -318,6 +318,7 @@ static const int halve_threshold = 25;
 >  static const int inflate_threshold = 50;
 >  static const int halve_threshold_root = 8;
 >  static const int inflate_threshold_root = 15;
 > +static int inflate_threshold_root_fix;
 >  
 >  
 >  static void __alias_free_mem(struct rcu_head *head)
 > @@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 >  	/* Keep root node larger  */
 >  
 >  	if (!tn->parent)
 > -		inflate_threshold_use = inflate_threshold_root;
 > +		inflate_threshold_use = inflate_threshold_root +
 > +					inflate_threshold_root_fix;
 >  	else
 >  		inflate_threshold_use = inflate_threshold;
 >  
 > @@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 >  	}
 >  
 >  	if (max_resize < 0) {
 > -		if (!tn->parent)
 > -			pr_warning("Fix inflate_threshold_root."
 > -				   " Now=%d size=%d bits\n",
 > -				   inflate_threshold_root, tn->bits);
 > -		else
 > +		if (!tn->parent) {
 > +			if (inflate_threshold_root_fix * 2 <
 > +			    inflate_threshold_root)
 > +				inflate_threshold_root_fix++;
 > +			else
 > +				pr_warning("Fix inflate_threshold_root."
 > +					   " Now=%d size=%d bits fix=%d\n",
 > +					   inflate_threshold_root, tn->bits,
 > +					   inflate_threshold_root_fix);
 > +		} else {
 >  			pr_warning("Fix inflate_threshold."
 >  				   " Now=%d size=%d bits\n",
 >  				   inflate_threshold, tn->bits);
 > -	}
 > +		}
 > +	} else if (max_resize < 5 && !tn->parent && inflate_threshold_root_fix)
 > +		inflate_threshold_root_fix--;
 >  
 >  	check_tnode(tn);
 >  

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-27 20:51         ` Jarek Poplawski
  2009-06-28  0:28           ` Paweł Staszewski
@ 2009-06-28 11:11           ` Robert Olsson
  2009-06-29  7:57             ` Paweł Staszewski
  1 sibling, 1 reply; 99+ messages in thread
From: Robert Olsson @ 2009-06-28 11:11 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Robert Olsson, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list


When testing please monitor size of root node and and aver depth

Cheers
				--ro
       	       	       	    	 

Jarek Poplawski writes:
 > On Sat, Jun 27, 2009 at 09:20:57PM +0200, Jarek Poplawski wrote:
 > ...
 > > Then these settable thresholds might be more useful here than memory
 > > fixes, but here is some idea to try handle this automatically within
 > > some limits. The patch below increases inflate_threshold_root (only)
 > > up to ~50% of its initial value if needed, and should be able to go
 > > back sometimes.
 > > 
 > > Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with
 > > some offsets.)
 > 
 > A tiny adjustment in the last if...
 > 
 > Jarek P.
 > --- (take 2)
 > 
 >  net/ipv4/fib_trie.c |   23 ++++++++++++++++-------
 >  1 files changed, 16 insertions(+), 7 deletions(-)
 > 
 > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
 > index 012cf5a..1dc1bb4 100644
 > --- a/net/ipv4/fib_trie.c
 > +++ b/net/ipv4/fib_trie.c
 > @@ -318,6 +318,7 @@ static const int halve_threshold = 25;
 >  static const int inflate_threshold = 50;
 >  static const int halve_threshold_root = 8;
 >  static const int inflate_threshold_root = 15;
 > +static int inflate_threshold_root_fix;
 >  
 >  
 >  static void __alias_free_mem(struct rcu_head *head)
 > @@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 >  	/* Keep root node larger  */
 >  
 >  	if (!tn->parent)
 > -		inflate_threshold_use = inflate_threshold_root;
 > +		inflate_threshold_use = inflate_threshold_root +
 > +					inflate_threshold_root_fix;
 >  	else
 >  		inflate_threshold_use = inflate_threshold;
 >  
 > @@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn)
 >  	}
 >  
 >  	if (max_resize < 0) {
 > -		if (!tn->parent)
 > -			pr_warning("Fix inflate_threshold_root."
 > -				   " Now=%d size=%d bits\n",
 > -				   inflate_threshold_root, tn->bits);
 > -		else
 > +		if (!tn->parent) {
 > +			if (inflate_threshold_root_fix * 2 <
 > +			    inflate_threshold_root)
 > +				inflate_threshold_root_fix++;
 > +			else
 > +				pr_warning("Fix inflate_threshold_root."
 > +					   " Now=%d size=%d bits fix=%d\n",
 > +					   inflate_threshold_root, tn->bits,
 > +					   inflate_threshold_root_fix);
 > +		} else {
 >  			pr_warning("Fix inflate_threshold."
 >  				   " Now=%d size=%d bits\n",
 >  				   inflate_threshold, tn->bits);
 > -	}
 > +		}
 > +	} else if (max_resize > 4 && !tn->parent && inflate_threshold_root_fix)
 > +		inflate_threshold_root_fix--;
 >  
 >  	check_tnode(tn);
 >  
 > --
 > To unsubscribe from this list: send the line "unsubscribe netdev" in
 > the body of a message to majordomo@vger.kernel.org
 > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-28 11:04         ` Robert Olsson
@ 2009-06-28 12:03           ` Jarek Poplawski
  2009-06-28 14:35           ` Jarek Poplawski
  1 sibling, 0 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-28 12:03 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Robert Olsson, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

On Sun, Jun 28, 2009 at 01:04:51PM +0200, Robert Olsson wrote:
...
>  Yes we keep the old tnode size and the convergence interval was some
>  of the concerns. That why this checks was added. Still we want to
>  inflate the root node to a very max. 
> 
>  So this approach with halving or doubling tnodes towards the root
>  node was the suggest by Dyntree paper. I asked Stefan (one of the
>  authors) if we could get safe and very offensive settings. But 
>  from what I understood there was no easy way to calculate this. 
>  So any bright ideas in this area are very welcome. But we should
>  also monitor size of root and average tree depth so we don't
>  take an to defensive approach just to solve this case.

Yes, but with this offensive approach it seems the current level of
warnings could be too alarming. Btw., because of a design flaw in my
current patch this _fix variable, which should be logically per trie/
table, can be reset by changes of other tables now, but I think it
all could be fine tuned more in the future. Of course if there are
people interested in testing/reporting this more.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-28 11:04         ` Robert Olsson
  2009-06-28 12:03           ` Jarek Poplawski
@ 2009-06-28 14:35           ` Jarek Poplawski
  2009-06-28 15:32             ` Paweł Staszewski
  1 sibling, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-28 14:35 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Robert Olsson, =?ISO-8859-2?Q?Pawe=B3_Staszewski?=,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

On Sun, Jun 28, 2009 at 01:04:51PM +0200, Robert Olsson wrote:
...
>  The memory patches and "manual RCU" are also interesting to address
>  the case with PREEMTP's.

Since 2.6.29 looks like prefered here, and there were a lot of takes
in this thread, I attach below a combined all-in-one patch including:
- 2.6.29 -> 2.6.30 preemption disable patch
- 2 RCU vs. preemption fixes from 2.6.31-rc
- "manual RCU" patch to force vfree/kfree before root's resize (take 3)
- "automatic" inflate_threshold_root fix (take 2)

Thanks,
Jarek P.

--- (for 2.6.29.x or even .28 or .27; any testing appreciated)

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-06-27 20:25:06.000000000 +0200
+++ b/net/ipv4/fib_trie.c	2009-06-28 15:45:02.000000000 +0200
@@ -123,6 +123,7 @@ struct tnode {
 	union {
 		struct rcu_head rcu;
 		struct work_struct work;
+		struct tnode *tnode_free;
 	};
 	struct node *child[0];
 };
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
 static struct node *resize(struct trie *t, struct tnode *tn);
 static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -315,6 +318,7 @@ static const int halve_threshold = 25;
 static const int inflate_threshold = 50;
 static const int halve_threshold_root = 8;
 static const int inflate_threshold_root = 15;
+static int inflate_threshold_root_fix;
 
 
 static void __alias_free_mem(struct rcu_head *head)
@@ -363,6 +367,17 @@ static void __tnode_vfree(struct work_st
 	vfree(tn);
 }
 
+static void __tnode_free(struct tnode *tn)
+{
+	size_t size = sizeof(struct tnode) +
+		      (sizeof(struct node *) << tn->bits);
+
+	if (size <= PAGE_SIZE)
+		kfree(tn);
+	else
+		vfree(tn);
+}
+
 static void __tnode_free_rcu(struct rcu_head *head)
 {
 	struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -385,6 +400,24 @@ static inline void tnode_free(struct tno
 		call_rcu(&tn->rcu, __tnode_free_rcu);
 }
 
+static void tnode_free_safe(struct tnode *tn)
+{
+	BUG_ON(IS_LEAF(tn));
+	tn->tnode_free = tnode_free_head;
+	tnode_free_head = tn;
+}
+
+static void tnode_free_flush(void)
+{
+	struct tnode *tn;
+
+	while ((tn = tnode_free_head)) {
+		tnode_free_head = tn->tnode_free;
+		tn->tnode_free = NULL;
+		__tnode_free(tn);
+	}
+}
+
 static struct leaf *leaf_new(void)
 {
 	struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
@@ -495,7 +528,7 @@ static struct node *resize(struct trie *
 
 	/* No children */
 	if (tn->empty_children == tnode_child_length(tn)) {
-		tnode_free(tn);
+		tnode_free_safe(tn);
 		return NULL;
 	}
 	/* One child */
@@ -509,7 +542,7 @@ static struct node *resize(struct trie *
 
 			/* compress one level */
 			node_set_parent(n, NULL);
-			tnode_free(tn);
+			tnode_free_safe(tn);
 			return n;
 		}
 	/*
@@ -581,7 +614,8 @@ static struct node *resize(struct trie *
 	/* Keep root node larger  */
 
 	if (!tn->parent)
-		inflate_threshold_use = inflate_threshold_root;
+		inflate_threshold_use = inflate_threshold_root +
+					inflate_threshold_root_fix;
 	else
 		inflate_threshold_use = inflate_threshold;
 
@@ -605,15 +639,22 @@ static struct node *resize(struct trie *
 	}
 
 	if (max_resize < 0) {
-		if (!tn->parent)
-			pr_warning("Fix inflate_threshold_root."
-				   " Now=%d size=%d bits\n",
-				   inflate_threshold_root, tn->bits);
-		else
+		if (!tn->parent) {
+			if (inflate_threshold_root_fix * 2 <
+			    inflate_threshold_root)
+				inflate_threshold_root_fix++;
+			else
+				pr_warning("Fix inflate_threshold_root."
+					   " Now=%d size=%d bits fix=%d\n",
+					   inflate_threshold_root, tn->bits,
+					   inflate_threshold_root_fix);
+		} else {
 			pr_warning("Fix inflate_threshold."
 				   " Now=%d size=%d bits\n",
 				   inflate_threshold, tn->bits);
-	}
+		}
+	} else if (max_resize > 4 && !tn->parent && inflate_threshold_root_fix)
+		inflate_threshold_root_fix--;
 
 	check_tnode(tn);
 
@@ -670,7 +711,7 @@ static struct node *resize(struct trie *
 			/* compress one level */
 
 			node_set_parent(n, NULL);
-			tnode_free(tn);
+			tnode_free_safe(tn);
 			return n;
 		}
 
@@ -756,7 +797,7 @@ static struct tnode *inflate(struct trie
 			put_child(t, tn, 2*i, inode->child[0]);
 			put_child(t, tn, 2*i+1, inode->child[1]);
 
-			tnode_free(inode);
+			tnode_free_safe(inode);
 			continue;
 		}
 
@@ -801,9 +842,9 @@ static struct tnode *inflate(struct trie
 		put_child(t, tn, 2*i, resize(t, left));
 		put_child(t, tn, 2*i+1, resize(t, right));
 
-		tnode_free(inode);
+		tnode_free_safe(inode);
 	}
-	tnode_free(oldtnode);
+	tnode_free_safe(oldtnode);
 	return tn;
 nomem:
 	{
@@ -885,7 +926,7 @@ static struct tnode *halve(struct trie *
 		put_child(t, newBinNode, 1, right);
 		put_child(t, tn, i/2, resize(t, newBinNode));
 	}
-	tnode_free(oldtnode);
+	tnode_free_safe(oldtnode);
 	return tn;
 nomem:
 	{
@@ -983,12 +1024,14 @@ fib_find_node(struct trie *t, u32 key)
 	return NULL;
 }
 
-static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
+static void trie_rebalance(struct trie *t, struct tnode *tn)
 {
 	int wasfull;
-	t_key cindex, key = tn->key;
+	t_key cindex, key;
 	struct tnode *tp;
 
+	key = tn->key;
+
 	while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) {
 		cindex = tkey_extract_bits(key, tp->pos, tp->bits);
 		wasfull = tnode_full(tp, tnode_get_child(tp, cindex));
@@ -1003,11 +1046,22 @@ static struct node *trie_rebalance(struc
 		tn = tp;
 	}
 
+	if (tnode_free_head) {
+		synchronize_rcu();
+		tnode_free_flush();
+	}
+
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn))
+	if (IS_TNODE(tn)) {
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+		synchronize_rcu();
+		tnode_free_flush();
+	} else {
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+	}
 
-	return (struct node *)tn;
+	return;
 }
 
 /* only used from updater-side */
@@ -1155,7 +1209,7 @@ static struct list_head *fib_insert_node
 
 	/* Rebalance the trie */
 
-	rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
+	trie_rebalance(t, tp);
 done:
 	return fa_head;
 }
@@ -1575,7 +1629,7 @@ static void trie_leaf_remove(struct trie
 	if (tp) {
 		t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits);
 		put_child(t, (struct tnode *)tp, cindex, NULL);
-		rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
+		trie_rebalance(t, tp);
 	} else
 		rcu_assign_pointer(t->trie, NULL);
 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-28 14:35           ` Jarek Poplawski
@ 2009-06-28 15:32             ` Paweł Staszewski
  2009-06-28 15:48               ` Paweł Staszewski
  0 siblings, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-28 15:32 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list



After 18 hours from apply first Jarek patch i have no info about Fix 
inflate_threshold_root
even if i make: "clear ip bgp *" on router
So i change Jarek patch from previous to this new one for test and we 
will see ...

Regards
Pawel Staszewski


Jarek Poplawski pisze:
> On Sun, Jun 28, 2009 at 01:04:51PM +0200, Robert Olsson wrote:
> ...
>   
>>  The memory patches and "manual RCU" are also interesting to address
>>  the case with PREEMTP's.
>>     
>
> Since 2.6.29 looks like prefered here, and there were a lot of takes
> in this thread, I attach below a combined all-in-one patch including:
> - 2.6.29 -> 2.6.30 preemption disable patch
> - 2 RCU vs. preemption fixes from 2.6.31-rc
> - "manual RCU" patch to force vfree/kfree before root's resize (take 3)
> - "automatic" inflate_threshold_root fix (take 2)
>
> Thanks,
> Jarek P.
>
> --- (for 2.6.29.x or even .28 or .27; any testing appreciated)
>
> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> --- a/net/ipv4/fib_trie.c	2009-06-27 20:25:06.000000000 +0200
> +++ b/net/ipv4/fib_trie.c	2009-06-28 15:45:02.000000000 +0200
> @@ -123,6 +123,7 @@ struct tnode {
>  	union {
>  		struct rcu_head rcu;
>  		struct work_struct work;
> +		struct tnode *tnode_free;
>  	};
>  	struct node *child[0];
>  };
> @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
>  static struct node *resize(struct trie *t, struct tnode *tn);
>  static struct tnode *inflate(struct trie *t, struct tnode *tn);
>  static struct tnode *halve(struct trie *t, struct tnode *tn);
> +/* tnodes to free after resize(); protected by RTNL */
> +static struct tnode *tnode_free_head;
>  
>  static struct kmem_cache *fn_alias_kmem __read_mostly;
>  static struct kmem_cache *trie_leaf_kmem __read_mostly;
> @@ -315,6 +318,7 @@ static const int halve_threshold = 25;
>  static const int inflate_threshold = 50;
>  static const int halve_threshold_root = 8;
>  static const int inflate_threshold_root = 15;
> +static int inflate_threshold_root_fix;
>  
>  
>  static void __alias_free_mem(struct rcu_head *head)
> @@ -363,6 +367,17 @@ static void __tnode_vfree(struct work_st
>  	vfree(tn);
>  }
>  
> +static void __tnode_free(struct tnode *tn)
> +{
> +	size_t size = sizeof(struct tnode) +
> +		      (sizeof(struct node *) << tn->bits);
> +
> +	if (size <= PAGE_SIZE)
> +		kfree(tn);
> +	else
> +		vfree(tn);
> +}
> +
>  static void __tnode_free_rcu(struct rcu_head *head)
>  {
>  	struct tnode *tn = container_of(head, struct tnode, rcu);
> @@ -385,6 +400,24 @@ static inline void tnode_free(struct tno
>  		call_rcu(&tn->rcu, __tnode_free_rcu);
>  }
>  
> +static void tnode_free_safe(struct tnode *tn)
> +{
> +	BUG_ON(IS_LEAF(tn));
> +	tn->tnode_free = tnode_free_head;
> +	tnode_free_head = tn;
> +}
> +
> +static void tnode_free_flush(void)
> +{
> +	struct tnode *tn;
> +
> +	while ((tn = tnode_free_head)) {
> +		tnode_free_head = tn->tnode_free;
> +		tn->tnode_free = NULL;
> +		__tnode_free(tn);
> +	}
> +}
> +
>  static struct leaf *leaf_new(void)
>  {
>  	struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
> @@ -495,7 +528,7 @@ static struct node *resize(struct trie *
>  
>  	/* No children */
>  	if (tn->empty_children == tnode_child_length(tn)) {
> -		tnode_free(tn);
> +		tnode_free_safe(tn);
>  		return NULL;
>  	}
>  	/* One child */
> @@ -509,7 +542,7 @@ static struct node *resize(struct trie *
>  
>  			/* compress one level */
>  			node_set_parent(n, NULL);
> -			tnode_free(tn);
> +			tnode_free_safe(tn);
>  			return n;
>  		}
>  	/*
> @@ -581,7 +614,8 @@ static struct node *resize(struct trie *
>  	/* Keep root node larger  */
>  
>  	if (!tn->parent)
> -		inflate_threshold_use = inflate_threshold_root;
> +		inflate_threshold_use = inflate_threshold_root +
> +					inflate_threshold_root_fix;
>  	else
>  		inflate_threshold_use = inflate_threshold;
>  
> @@ -605,15 +639,22 @@ static struct node *resize(struct trie *
>  	}
>  
>  	if (max_resize < 0) {
> -		if (!tn->parent)
> -			pr_warning("Fix inflate_threshold_root."
> -				   " Now=%d size=%d bits\n",
> -				   inflate_threshold_root, tn->bits);
> -		else
> +		if (!tn->parent) {
> +			if (inflate_threshold_root_fix * 2 <
> +			    inflate_threshold_root)
> +				inflate_threshold_root_fix++;
> +			else
> +				pr_warning("Fix inflate_threshold_root."
> +					   " Now=%d size=%d bits fix=%d\n",
> +					   inflate_threshold_root, tn->bits,
> +					   inflate_threshold_root_fix);
> +		} else {
>  			pr_warning("Fix inflate_threshold."
>  				   " Now=%d size=%d bits\n",
>  				   inflate_threshold, tn->bits);
> -	}
> +		}
> +	} else if (max_resize > 4 && !tn->parent && inflate_threshold_root_fix)
> +		inflate_threshold_root_fix--;
>  
>  	check_tnode(tn);
>  
> @@ -670,7 +711,7 @@ static struct node *resize(struct trie *
>  			/* compress one level */
>  
>  			node_set_parent(n, NULL);
> -			tnode_free(tn);
> +			tnode_free_safe(tn);
>  			return n;
>  		}
>  
> @@ -756,7 +797,7 @@ static struct tnode *inflate(struct trie
>  			put_child(t, tn, 2*i, inode->child[0]);
>  			put_child(t, tn, 2*i+1, inode->child[1]);
>  
> -			tnode_free(inode);
> +			tnode_free_safe(inode);
>  			continue;
>  		}
>  
> @@ -801,9 +842,9 @@ static struct tnode *inflate(struct trie
>  		put_child(t, tn, 2*i, resize(t, left));
>  		put_child(t, tn, 2*i+1, resize(t, right));
>  
> -		tnode_free(inode);
> +		tnode_free_safe(inode);
>  	}
> -	tnode_free(oldtnode);
> +	tnode_free_safe(oldtnode);
>  	return tn;
>  nomem:
>  	{
> @@ -885,7 +926,7 @@ static struct tnode *halve(struct trie *
>  		put_child(t, newBinNode, 1, right);
>  		put_child(t, tn, i/2, resize(t, newBinNode));
>  	}
> -	tnode_free(oldtnode);
> +	tnode_free_safe(oldtnode);
>  	return tn;
>  nomem:
>  	{
> @@ -983,12 +1024,14 @@ fib_find_node(struct trie *t, u32 key)
>  	return NULL;
>  }
>  
> -static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
> +static void trie_rebalance(struct trie *t, struct tnode *tn)
>  {
>  	int wasfull;
> -	t_key cindex, key = tn->key;
> +	t_key cindex, key;
>  	struct tnode *tp;
>  
> +	key = tn->key;
> +
>  	while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) {
>  		cindex = tkey_extract_bits(key, tp->pos, tp->bits);
>  		wasfull = tnode_full(tp, tnode_get_child(tp, cindex));
> @@ -1003,11 +1046,22 @@ static struct node *trie_rebalance(struc
>  		tn = tp;
>  	}
>  
> +	if (tnode_free_head) {
> +		synchronize_rcu();
> +		tnode_free_flush();
> +	}
> +
>  	/* Handle last (top) tnode */
> -	if (IS_TNODE(tn))
> +	if (IS_TNODE(tn)) {
>  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
> +		rcu_assign_pointer(t->trie, (struct node *)tn);
> +		synchronize_rcu();
> +		tnode_free_flush();
> +	} else {
> +		rcu_assign_pointer(t->trie, (struct node *)tn);
> +	}
>  
> -	return (struct node *)tn;
> +	return;
>  }
>  
>  /* only used from updater-side */
> @@ -1155,7 +1209,7 @@ static struct list_head *fib_insert_node
>  
>  	/* Rebalance the trie */
>  
> -	rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
> +	trie_rebalance(t, tp);
>  done:
>  	return fa_head;
>  }
> @@ -1575,7 +1629,7 @@ static void trie_leaf_remove(struct trie
>  	if (tp) {
>  		t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits);
>  		put_child(t, (struct tnode *)tp, cindex, NULL);
> -		rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
> +		trie_rebalance(t, tp);
>  	} else
>  		rcu_assign_pointer(t->trie, NULL);
>  
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>   


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-28 15:32             ` Paweł Staszewski
@ 2009-06-28 15:48               ` Paweł Staszewski
  2009-06-28 19:56                 ` Jarek Poplawski
  2009-06-28 21:36                 ` Jarek Poplawski
  0 siblings, 2 replies; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-28 15:48 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list


After apply this patch something is wrong

Traffic is not forwarded
no info in dmesg / no info from bgp
and also i can't connect to bgpd process

I revert kernel to past version with first Jarek patch



Paweł Staszewski pisze:
>
>
> After 18 hours from apply first Jarek patch i have no info about Fix 
> inflate_threshold_root
> even if i make: "clear ip bgp *" on router
> So i change Jarek patch from previous to this new one for test and we 
> will see ...
>
> Regards
> Pawel Staszewski
>
>
> Jarek Poplawski pisze:
>> On Sun, Jun 28, 2009 at 01:04:51PM +0200, Robert Olsson wrote:
>> ...
>>  
>>>  The memory patches and "manual RCU" are also interesting to address
>>>  the case with PREEMTP's.
>>>     
>>
>> Since 2.6.29 looks like prefered here, and there were a lot of takes
>> in this thread, I attach below a combined all-in-one patch including:
>> - 2.6.29 -> 2.6.30 preemption disable patch
>> - 2 RCU vs. preemption fixes from 2.6.31-rc
>> - "manual RCU" patch to force vfree/kfree before root's resize (take 3)
>> - "automatic" inflate_threshold_root fix (take 2)
>>
>> Thanks,
>> Jarek P.
>>
>> --- (for 2.6.29.x or even .28 or .27; any testing appreciated)
>>
>> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>> --- a/net/ipv4/fib_trie.c    2009-06-27 20:25:06.000000000 +0200
>> +++ b/net/ipv4/fib_trie.c    2009-06-28 15:45:02.000000000 +0200
>> @@ -123,6 +123,7 @@ struct tnode {
>>      union {
>>          struct rcu_head rcu;
>>          struct work_struct work;
>> +        struct tnode *tnode_free;
>>      };
>>      struct node *child[0];
>>  };
>> @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
>>  static struct node *resize(struct trie *t, struct tnode *tn);
>>  static struct tnode *inflate(struct trie *t, struct tnode *tn);
>>  static struct tnode *halve(struct trie *t, struct tnode *tn);
>> +/* tnodes to free after resize(); protected by RTNL */
>> +static struct tnode *tnode_free_head;
>>  
>>  static struct kmem_cache *fn_alias_kmem __read_mostly;
>>  static struct kmem_cache *trie_leaf_kmem __read_mostly;
>> @@ -315,6 +318,7 @@ static const int halve_threshold = 25;
>>  static const int inflate_threshold = 50;
>>  static const int halve_threshold_root = 8;
>>  static const int inflate_threshold_root = 15;
>> +static int inflate_threshold_root_fix;
>>  
>>  
>>  static void __alias_free_mem(struct rcu_head *head)
>> @@ -363,6 +367,17 @@ static void __tnode_vfree(struct work_st
>>      vfree(tn);
>>  }
>>  
>> +static void __tnode_free(struct tnode *tn)
>> +{
>> +    size_t size = sizeof(struct tnode) +
>> +              (sizeof(struct node *) << tn->bits);
>> +
>> +    if (size <= PAGE_SIZE)
>> +        kfree(tn);
>> +    else
>> +        vfree(tn);
>> +}
>> +
>>  static void __tnode_free_rcu(struct rcu_head *head)
>>  {
>>      struct tnode *tn = container_of(head, struct tnode, rcu);
>> @@ -385,6 +400,24 @@ static inline void tnode_free(struct tno
>>          call_rcu(&tn->rcu, __tnode_free_rcu);
>>  }
>>  
>> +static void tnode_free_safe(struct tnode *tn)
>> +{
>> +    BUG_ON(IS_LEAF(tn));
>> +    tn->tnode_free = tnode_free_head;
>> +    tnode_free_head = tn;
>> +}
>> +
>> +static void tnode_free_flush(void)
>> +{
>> +    struct tnode *tn;
>> +
>> +    while ((tn = tnode_free_head)) {
>> +        tnode_free_head = tn->tnode_free;
>> +        tn->tnode_free = NULL;
>> +        __tnode_free(tn);
>> +    }
>> +}
>> +
>>  static struct leaf *leaf_new(void)
>>  {
>>      struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
>> @@ -495,7 +528,7 @@ static struct node *resize(struct trie *
>>  
>>      /* No children */
>>      if (tn->empty_children == tnode_child_length(tn)) {
>> -        tnode_free(tn);
>> +        tnode_free_safe(tn);
>>          return NULL;
>>      }
>>      /* One child */
>> @@ -509,7 +542,7 @@ static struct node *resize(struct trie *
>>  
>>              /* compress one level */
>>              node_set_parent(n, NULL);
>> -            tnode_free(tn);
>> +            tnode_free_safe(tn);
>>              return n;
>>          }
>>      /*
>> @@ -581,7 +614,8 @@ static struct node *resize(struct trie *
>>      /* Keep root node larger  */
>>  
>>      if (!tn->parent)
>> -        inflate_threshold_use = inflate_threshold_root;
>> +        inflate_threshold_use = inflate_threshold_root +
>> +                    inflate_threshold_root_fix;
>>      else
>>          inflate_threshold_use = inflate_threshold;
>>  
>> @@ -605,15 +639,22 @@ static struct node *resize(struct trie *
>>      }
>>  
>>      if (max_resize < 0) {
>> -        if (!tn->parent)
>> -            pr_warning("Fix inflate_threshold_root."
>> -                   " Now=%d size=%d bits\n",
>> -                   inflate_threshold_root, tn->bits);
>> -        else
>> +        if (!tn->parent) {
>> +            if (inflate_threshold_root_fix * 2 <
>> +                inflate_threshold_root)
>> +                inflate_threshold_root_fix++;
>> +            else
>> +                pr_warning("Fix inflate_threshold_root."
>> +                       " Now=%d size=%d bits fix=%d\n",
>> +                       inflate_threshold_root, tn->bits,
>> +                       inflate_threshold_root_fix);
>> +        } else {
>>              pr_warning("Fix inflate_threshold."
>>                     " Now=%d size=%d bits\n",
>>                     inflate_threshold, tn->bits);
>> -    }
>> +        }
>> +    } else if (max_resize > 4 && !tn->parent && 
>> inflate_threshold_root_fix)
>> +        inflate_threshold_root_fix--;
>>  
>>      check_tnode(tn);
>>  
>> @@ -670,7 +711,7 @@ static struct node *resize(struct trie *
>>              /* compress one level */
>>  
>>              node_set_parent(n, NULL);
>> -            tnode_free(tn);
>> +            tnode_free_safe(tn);
>>              return n;
>>          }
>>  
>> @@ -756,7 +797,7 @@ static struct tnode *inflate(struct trie
>>              put_child(t, tn, 2*i, inode->child[0]);
>>              put_child(t, tn, 2*i+1, inode->child[1]);
>>  
>> -            tnode_free(inode);
>> +            tnode_free_safe(inode);
>>              continue;
>>          }
>>  
>> @@ -801,9 +842,9 @@ static struct tnode *inflate(struct trie
>>          put_child(t, tn, 2*i, resize(t, left));
>>          put_child(t, tn, 2*i+1, resize(t, right));
>>  
>> -        tnode_free(inode);
>> +        tnode_free_safe(inode);
>>      }
>> -    tnode_free(oldtnode);
>> +    tnode_free_safe(oldtnode);
>>      return tn;
>>  nomem:
>>      {
>> @@ -885,7 +926,7 @@ static struct tnode *halve(struct trie *
>>          put_child(t, newBinNode, 1, right);
>>          put_child(t, tn, i/2, resize(t, newBinNode));
>>      }
>> -    tnode_free(oldtnode);
>> +    tnode_free_safe(oldtnode);
>>      return tn;
>>  nomem:
>>      {
>> @@ -983,12 +1024,14 @@ fib_find_node(struct trie *t, u32 key)
>>      return NULL;
>>  }
>>  
>> -static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
>> +static void trie_rebalance(struct trie *t, struct tnode *tn)
>>  {
>>      int wasfull;
>> -    t_key cindex, key = tn->key;
>> +    t_key cindex, key;
>>      struct tnode *tp;
>>  
>> +    key = tn->key;
>> +
>>      while (tn != NULL && (tp = node_parent((struct node *)tn)) != 
>> NULL) {
>>          cindex = tkey_extract_bits(key, tp->pos, tp->bits);
>>          wasfull = tnode_full(tp, tnode_get_child(tp, cindex));
>> @@ -1003,11 +1046,22 @@ static struct node *trie_rebalance(struc
>>          tn = tp;
>>      }
>>  
>> +    if (tnode_free_head) {
>> +        synchronize_rcu();
>> +        tnode_free_flush();
>> +    }
>> +
>>      /* Handle last (top) tnode */
>> -    if (IS_TNODE(tn))
>> +    if (IS_TNODE(tn)) {
>>          tn = (struct tnode *)resize(t, (struct tnode *)tn);
>> +        rcu_assign_pointer(t->trie, (struct node *)tn);
>> +        synchronize_rcu();
>> +        tnode_free_flush();
>> +    } else {
>> +        rcu_assign_pointer(t->trie, (struct node *)tn);
>> +    }
>>  
>> -    return (struct node *)tn;
>> +    return;
>>  }
>>  
>>  /* only used from updater-side */
>> @@ -1155,7 +1209,7 @@ static struct list_head *fib_insert_node
>>  
>>      /* Rebalance the trie */
>>  
>> -    rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
>> +    trie_rebalance(t, tp);
>>  done:
>>      return fa_head;
>>  }
>> @@ -1575,7 +1629,7 @@ static void trie_leaf_remove(struct trie
>>      if (tp) {
>>          t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits);
>>          put_child(t, (struct tnode *)tp, cindex, NULL);
>> -        rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
>> +        trie_rebalance(t, tp);
>>      } else
>>          rcu_assign_pointer(t->trie, NULL);
>>  
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>   
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-28 15:48               ` Paweł Staszewski
@ 2009-06-28 19:56                 ` Jarek Poplawski
  2009-06-28 21:36                 ` Jarek Poplawski
  1 sibling, 0 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-28 19:56 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

On Sun, Jun 28, 2009 at 05:48:19PM +0200, Paweł Staszewski wrote:
>
> After apply this patch something is wrong
>
> Traffic is not forwarded
> no info in dmesg / no info from bgp
> and also i can't connect to bgpd process
>
> I revert kernel to past version with first Jarek patch
>

Thank you very much, Pawel, for trying this. I'm starting to look for
the reason. In the meantime try to get some fib_trie stats for Robert.

Jarek P.
>
>
> Paweł Staszewski pisze:
>>
>>
>> After 18 hours from apply first Jarek patch i have no info about Fix  
>> inflate_threshold_root
>> even if i make: "clear ip bgp *" on router
>> So i change Jarek patch from previous to this new one for test and we  
>> will see ...
>>
>> Regards
>> Pawel Staszewski
>>
>>
>> Jarek Poplawski pisze:
>>> On Sun, Jun 28, 2009 at 01:04:51PM +0200, Robert Olsson wrote:
>>> ...
>>>  
>>>>  The memory patches and "manual RCU" are also interesting to address
>>>>  the case with PREEMTP's.
>>>>     
>>>
>>> Since 2.6.29 looks like prefered here, and there were a lot of takes
>>> in this thread, I attach below a combined all-in-one patch including:
>>> - 2.6.29 -> 2.6.30 preemption disable patch
>>> - 2 RCU vs. preemption fixes from 2.6.31-rc
>>> - "manual RCU" patch to force vfree/kfree before root's resize (take 3)
>>> - "automatic" inflate_threshold_root fix (take 2)
>>>
>>> Thanks,
>>> Jarek P.
>>>
>>> --- (for 2.6.29.x or even .28 or .27; any testing appreciated)
>>>
>>> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>>> --- a/net/ipv4/fib_trie.c    2009-06-27 20:25:06.000000000 +0200
>>> +++ b/net/ipv4/fib_trie.c    2009-06-28 15:45:02.000000000 +0200
...

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-28 15:48               ` Paweł Staszewski
  2009-06-28 19:56                 ` Jarek Poplawski
@ 2009-06-28 21:36                 ` Jarek Poplawski
  2009-06-29  8:08                   ` Paweł Staszewski
  2009-06-29  8:33                   ` [PATCH net-2.6] " Jarek Poplawski
  1 sibling, 2 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-28 21:36 UTC (permalink / raw)
  To: Paweł Staszewski, David Miller
  Cc: Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

To David Miller:
since among patches tested negatively by Pawel are current 2 fixes
from 2.6.31-rc, I hope they weren't sent to -stable yet. Otherwise,
please withdraw them until they are tested alone. Thanks.

To Pawel:
On Sun, Jun 28, 2009 at 05:48:19PM +0200, Paweł Staszewski wrote:
>
> After apply this patch something is wrong
>
> Traffic is not forwarded
> no info in dmesg / no info from bgp
> and also i can't connect to bgpd process
>
> I revert kernel to past version with first Jarek patch
>

Since checking this can take time I attach here a patch with only
changes which are currently in 2.6.31-rc. Of course, this part can be
broken as well, so it's up to you: if you could try it with caution
somewhere it would be very helpful; otherwise don't bother.

It could be applied to 2.6.29 with or without this currently working
patch.

Thanks,
Jarek P.
--- (for 2.6.29.x, .28 or .27)

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-06-27 20:25:06.000000000 +0200
+++ b/net/ipv4/fib_trie.c	2009-06-28 23:06:02.000000000 +0200
@@ -123,6 +123,7 @@ struct tnode {
 	union {
 		struct rcu_head rcu;
 		struct work_struct work;
+		struct tnode *tnode_free;
 	};
 	struct node *child[0];
 };
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
 static struct node *resize(struct trie *t, struct tnode *tn);
 static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
 		call_rcu(&tn->rcu, __tnode_free_rcu);
 }
 
+static void tnode_free_safe(struct tnode *tn)
+{
+	BUG_ON(IS_LEAF(tn));
+	tn->tnode_free = tnode_free_head;
+	tnode_free_head = tn;
+}
+
+static void tnode_free_flush(void)
+{
+	struct tnode *tn;
+
+	while ((tn = tnode_free_head)) {
+		tnode_free_head = tn->tnode_free;
+		tn->tnode_free = NULL;
+		tnode_free(tn);
+	}
+}
+
 static struct leaf *leaf_new(void)
 {
 	struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
@@ -495,7 +516,7 @@ static struct node *resize(struct trie *
 
 	/* No children */
 	if (tn->empty_children == tnode_child_length(tn)) {
-		tnode_free(tn);
+		tnode_free_safe(tn);
 		return NULL;
 	}
 	/* One child */
@@ -509,7 +530,7 @@ static struct node *resize(struct trie *
 
 			/* compress one level */
 			node_set_parent(n, NULL);
-			tnode_free(tn);
+			tnode_free_safe(tn);
 			return n;
 		}
 	/*
@@ -670,7 +691,7 @@ static struct node *resize(struct trie *
 			/* compress one level */
 
 			node_set_parent(n, NULL);
-			tnode_free(tn);
+			tnode_free_safe(tn);
 			return n;
 		}
 
@@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie
 			put_child(t, tn, 2*i, inode->child[0]);
 			put_child(t, tn, 2*i+1, inode->child[1]);
 
-			tnode_free(inode);
+			tnode_free_safe(inode);
 			continue;
 		}
 
@@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie
 		put_child(t, tn, 2*i, resize(t, left));
 		put_child(t, tn, 2*i+1, resize(t, right));
 
-		tnode_free(inode);
+		tnode_free_safe(inode);
 	}
-	tnode_free(oldtnode);
+	tnode_free_safe(oldtnode);
 	return tn;
 nomem:
 	{
@@ -885,7 +906,7 @@ static struct tnode *halve(struct trie *
 		put_child(t, newBinNode, 1, right);
 		put_child(t, tn, i/2, resize(t, newBinNode));
 	}
-	tnode_free(oldtnode);
+	tnode_free_safe(oldtnode);
 	return tn;
 nomem:
 	{
@@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key)
 	return NULL;
 }
 
-static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
+static void trie_rebalance(struct trie *t, struct tnode *tn)
 {
 	int wasfull;
-	t_key cindex, key = tn->key;
+	t_key cindex, key;
 	struct tnode *tp;
 
+	key = tn->key;
+
 	while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) {
 		cindex = tkey_extract_bits(key, tp->pos, tp->bits);
 		wasfull = tnode_full(tp, tnode_get_child(tp, cindex));
@@ -998,6 +1021,7 @@ static struct node *trie_rebalance(struc
 				      (struct node *)tn, wasfull);
 
 		tp = node_parent((struct node *) tn);
+		tnode_free_flush();
 		if (!tp)
 			break;
 		tn = tp;
@@ -1007,7 +1031,10 @@ static struct node *trie_rebalance(struc
 	if (IS_TNODE(tn))
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
 
-	return (struct node *)tn;
+	rcu_assign_pointer(t->trie, (struct node *)tn);
+	tnode_free_flush();
+
+	return;
 }
 
 /* only used from updater-side */
@@ -1155,7 +1182,7 @@ static struct list_head *fib_insert_node
 
 	/* Rebalance the trie */
 
-	rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
+	trie_rebalance(t, tp);
 done:
 	return fa_head;
 }
@@ -1575,7 +1602,7 @@ static void trie_leaf_remove(struct trie
 	if (tp) {
 		t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits);
 		put_child(t, (struct tnode *)tp, cindex, NULL);
-		rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
+		trie_rebalance(t, tp);
 	} else
 		rcu_assign_pointer(t->trie, NULL);
 


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-28 11:11           ` Robert Olsson
@ 2009-06-29  7:57             ` Paweł Staszewski
  0 siblings, 0 replies; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-29  7:57 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Jarek Poplawski, Robert Olsson, Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

Robert Olsson pisze:
> When testing please monitor size of root node and and aver depth
>
> Cheers
> 				--ro
>   
Some fib_triestats - kernel.2.6.29.5 with first Jarek patch.

Dump every 10sec:

Mon Jun 29 11:54:31 2009
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.28
        Max depth:      7
        Leaves:         276978
        Prefixes:       290448
        Internal nodes: 66813
          1: 34703  2: 13944  3: 9921  4: 4807  5: 2273  6: 1158  7: 5  
9: 1  18: 1
        Pointers: 691606
Null ptrs: 347816
Total size: 18403  kB

Counters:
---------
gets = 390981859
backtracks = 5332465
semantic match passed = 390452936
semantic match miss = 30198
null node hit= 375522207
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 391017445
backtracks = 121012874
semantic match passed = 37565
semantic match miss = 0
null node hit= 261583
skipped node resize = 0

Mon Jun 29 11:54:41 2009
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.28
        Max depth:      7
        Leaves:         276976
        Prefixes:       290446
        Internal nodes: 66813
          1: 34703  2: 13944  3: 9921  4: 4807  5: 2273  6: 1158  7: 5  
9: 1  18: 1
        Pointers: 691606
Null ptrs: 347818
Total size: 18403  kB

Counters:
---------
gets = 391061852
backtracks = 5334173
semantic match passed = 390532664
semantic match miss = 30199
null node hit= 375595706
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 391097445
backtracks = 121039213
semantic match passed = 37570
semantic match miss = 0
null node hit= 261589
skipped node resize = 0

Mon Jun 29 11:54:51 2009
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.28
        Max depth:      7
        Leaves:         276978
        Prefixes:       290448
        Internal nodes: 66813
          1: 34703  2: 13944  3: 9921  4: 4807  5: 2273  6: 1158  7: 5  
9: 1  18: 1
        Pointers: 691606
Null ptrs: 347816
Total size: 18403  kB

Counters:
---------
gets = 391177325
backtracks = 5336127
semantic match passed = 390647917
semantic match miss = 30208
null node hit= 375699713
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 391212932
backtracks = 121075919
semantic match passed = 37586
semantic match miss = 0
null node hit= 261701
skipped node resize = 0

Mon Jun 29 11:55:01 2009
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.28
        Max depth:      7
        Leaves:         276978
        Prefixes:       290448
        Internal nodes: 66813
          1: 34703  2: 13944  3: 9921  4: 4807  5: 2273  6: 1158  7: 5  
9: 1  18: 1
        Pointers: 691606
Null ptrs: 347816
Total size: 18403  kB

Counters:
---------
gets = 391254016
backtracks = 5337816
semantic match passed = 390724361
semantic match miss = 30214
null node hit= 375768712
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 391289631
backtracks = 121101285
semantic match passed = 37598
semantic match miss = 0
null node hit= 261707
skipped node resize = 0

Mon Jun 29 11:55:11 2009
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.28
        Max depth:      7
        Leaves:         276976
        Prefixes:       290445
        Internal nodes: 66812
          1: 34702  2: 13944  3: 9921  4: 4807  5: 2273  6: 1158  7: 5  
9: 1  18: 1
        Pointers: 691604
Null ptrs: 347817
Total size: 18402  kB

Counters:
---------
gets = 391317389
backtracks = 5339175
semantic match passed = 390787523
semantic match miss = 30215
null node hit= 375827612
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 391353001
backtracks = 121122087
semantic match passed = 37599
semantic match miss = 0
null node hit= 261709
skipped node resize = 0

Mon Jun 29 11:55:21 2009
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.28
        Max depth:      7
        Leaves:         276981
        Prefixes:       290451
        Internal nodes: 66813
          1: 34704  2: 13943  3: 9921  4: 4807  5: 2273  6: 1158  7: 5  
9: 1  18: 1
        Pointers: 691604
Null ptrs: 347811
Total size: 18403  kB

Counters:
---------
gets = 391434307
backtracks = 5340855
semantic match passed = 390904256
semantic match miss = 30225
null node hit= 375931780
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 391469942
backtracks = 121157220
semantic match passed = 37619
semantic match miss = 0
null node hit= 261753
skipped node resize = 0

Mon Jun 29 11:55:31 2009
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.28
        Max depth:      7
        Leaves:         276981
        Prefixes:       290451
        Internal nodes: 66813
          1: 34704  2: 13943  3: 9921  4: 4807  5: 2273  6: 1158  7: 5  
9: 1  18: 1
        Pointers: 691604
Null ptrs: 347811
Total size: 18403  kB

Counters:
---------
gets = 391519852
backtracks = 5342208
semantic match passed = 390989658
semantic match miss = 30234
null node hit= 376010537
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 391555492
backtracks = 121181992
semantic match passed = 37625
semantic match miss = 0
null node hit= 261762
skipped node resize = 0

Mon Jun 29 11:55:42 2009
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.28
        Max depth:      7
        Leaves:         276978
        Prefixes:       290447
        Internal nodes: 66812
          1: 34703  2: 13943  3: 9921  4: 4807  5: 2273  6: 1158  7: 5  
9: 1  18: 1
        Pointers: 691602
Null ptrs: 347813
Total size: 18403  kB

Counters:
---------
gets = 391589032
backtracks = 5343757
semantic match passed = 391058601
semantic match miss = 30237
null node hit= 376075389
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 391624673
backtracks = 121202115
semantic match passed = 37628
semantic match miss = 0
null node hit= 261763
skipped node resize = 0

Mon Jun 29 11:55:52 2009
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.28
        Max depth:      7
        Leaves:         276985
        Prefixes:       290455
        Internal nodes: 66814
          1: 34704  2: 13944  3: 9921  4: 4807  5: 2273  6: 1158  7: 5  
9: 1  18: 1
        Pointers: 691608
Null ptrs: 347810
Total size: 18403  kB

Counters:
---------
gets = 391723925
backtracks = 5345934
semantic match passed = 391193292
semantic match miss = 30242
null node hit= 376195655
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 391759578
backtracks = 121241080
semantic match passed = 37640
semantic match miss = 0
null node hit= 261804
skipped node resize = 0

Mon Jun 29 11:56:02 2009
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.28
        Max depth:      7
        Leaves:         276985
        Prefixes:       290455
        Internal nodes: 66814
          1: 34704  2: 13944  3: 9921  4: 4807  5: 2273  6: 1158  7: 5  
9: 1  18: 1
        Pointers: 691608
Null ptrs: 347810
Total size: 18403  kB

Counters:
---------
gets = 391811219
backtracks = 5347635
semantic match passed = 391280357
semantic match miss = 30250
null node hit= 376276182
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 391846880
backtracks = 121265316
semantic match passed = 37648
semantic match miss = 0
null node hit= 261813
skipped node resize = 0


>      	       	       	    	 
>
> Jarek Poplawski writes:
>  > On Sat, Jun 27, 2009 at 09:20:57PM +0200, Jarek Poplawski wrote:
>  > ...
>  > > Then these settable thresholds might be more useful here than memory
>  > > fixes, but here is some idea to try handle this automatically within
>  > > some limits. The patch below increases inflate_threshold_root (only)
>  > > up to ~50% of its initial value if needed, and should be able to go
>  > > back sometimes.
>  > > 
>  > > Pawel and Jorge, could you try this? (It applies to 2.6.29 too, with
>  > > some offsets.)
>  > 
>  > A tiny adjustment in the last if...
>  > 
>  > Jarek P.
>  > --- (take 2)
>  > 
>  >  net/ipv4/fib_trie.c |   23 ++++++++++++++++-------
>  >  1 files changed, 16 insertions(+), 7 deletions(-)
>  > 
>  > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>  > index 012cf5a..1dc1bb4 100644
>  > --- a/net/ipv4/fib_trie.c
>  > +++ b/net/ipv4/fib_trie.c
>  > @@ -318,6 +318,7 @@ static const int halve_threshold = 25;
>  >  static const int inflate_threshold = 50;
>  >  static const int halve_threshold_root = 8;
>  >  static const int inflate_threshold_root = 15;
>  > +static int inflate_threshold_root_fix;
>  >  
>  >  
>  >  static void __alias_free_mem(struct rcu_head *head)
>  > @@ -602,7 +603,8 @@ static struct node *resize(struct trie *t, struct tnode *tn)
>  >  	/* Keep root node larger  */
>  >  
>  >  	if (!tn->parent)
>  > -		inflate_threshold_use = inflate_threshold_root;
>  > +		inflate_threshold_use = inflate_threshold_root +
>  > +					inflate_threshold_root_fix;
>  >  	else
>  >  		inflate_threshold_use = inflate_threshold;
>  >  
>  > @@ -626,15 +628,22 @@ static struct node *resize(struct trie *t, struct tnode *tn)
>  >  	}
>  >  
>  >  	if (max_resize < 0) {
>  > -		if (!tn->parent)
>  > -			pr_warning("Fix inflate_threshold_root."
>  > -				   " Now=%d size=%d bits\n",
>  > -				   inflate_threshold_root, tn->bits);
>  > -		else
>  > +		if (!tn->parent) {
>  > +			if (inflate_threshold_root_fix * 2 <
>  > +			    inflate_threshold_root)
>  > +				inflate_threshold_root_fix++;
>  > +			else
>  > +				pr_warning("Fix inflate_threshold_root."
>  > +					   " Now=%d size=%d bits fix=%d\n",
>  > +					   inflate_threshold_root, tn->bits,
>  > +					   inflate_threshold_root_fix);
>  > +		} else {
>  >  			pr_warning("Fix inflate_threshold."
>  >  				   " Now=%d size=%d bits\n",
>  >  				   inflate_threshold, tn->bits);
>  > -	}
>  > +		}
>  > +	} else if (max_resize > 4 && !tn->parent && inflate_threshold_root_fix)
>  > +		inflate_threshold_root_fix--;
>  >  
>  >  	check_tnode(tn);
>  >  
>  > --
>  > To unsubscribe from this list: send the line "unsubscribe netdev" in
>  > the body of a message to majordomo@vger.kernel.org
>  > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>   


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-28 21:36                 ` Jarek Poplawski
@ 2009-06-29  8:08                   ` Paweł Staszewski
  2009-06-29  8:47                     ` Paweł Staszewski
  2009-06-29  8:33                   ` [PATCH net-2.6] " Jarek Poplawski
  1 sibling, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-29  8:08 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, Robert Olsson, Robert Olsson,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

Jarek Poplawski pisze:
> To David Miller:
> since among patches tested negatively by Pawel are current 2 fixes
> from 2.6.31-rc, I hope they weren't sent to -stable yet. Otherwise,
> please withdraw them until they are tested alone. Thanks.
>
> To Pawel:
> On Sun, Jun 28, 2009 at 05:48:19PM +0200, Paweł Staszewski wrote:
>   
>> After apply this patch something is wrong
>>
>> Traffic is not forwarded
>> no info in dmesg / no info from bgp
>> and also i can't connect to bgpd process
>>
>> I revert kernel to past version with first Jarek patch
>>
>>     
>
> Since checking this can take time I attach here a patch with only
> changes which are currently in 2.6.31-rc. Of course, this part can be
> broken as well, so it's up to you: if you could try it with caution
> somewhere it would be very helpful; otherwise don't bother.
>
> It could be applied to 2.6.29 with or without this currently working
> patch.
>
>   

Ok.
I applied this patch 15mins ago to 2.6.29.5 and now it's working - 
traffic is forwarded.

Some fib_triestats
cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.29
        Max depth:      6
        Leaves:         277015
        Prefixes:       290493
        Internal nodes: 67115
          1: 35733  2: 13635  3: 9544  4: 4832  5: 2239  6: 1125  7: 5  
9: 1  18: 1
        Pointers: 686614
Null ptrs: 342485
Total size: 18396  kB

Counters:
---------
gets = 3956301
backtracks = 192497
semantic match passed = 3895955
semantic match miss = 133
null node hit= 4306948
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 3960981
backtracks = 2152441
semantic match passed = 4757
semantic match miss = 0
null node hit= 194997
skipped node resize = 0



> Thanks,
> Jarek P.
> --- (for 2.6.29.x, .28 or .27)
>
> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> --- a/net/ipv4/fib_trie.c	2009-06-27 20:25:06.000000000 +0200
> +++ b/net/ipv4/fib_trie.c	2009-06-28 23:06:02.000000000 +0200
> @@ -123,6 +123,7 @@ struct tnode {
>  	union {
>  		struct rcu_head rcu;
>  		struct work_struct work;
> +		struct tnode *tnode_free;
>  	};
>  	struct node *child[0];
>  };
> @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
>  static struct node *resize(struct trie *t, struct tnode *tn);
>  static struct tnode *inflate(struct trie *t, struct tnode *tn);
>  static struct tnode *halve(struct trie *t, struct tnode *tn);
> +/* tnodes to free after resize(); protected by RTNL */
> +static struct tnode *tnode_free_head;
>  
>  static struct kmem_cache *fn_alias_kmem __read_mostly;
>  static struct kmem_cache *trie_leaf_kmem __read_mostly;
> @@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
>  		call_rcu(&tn->rcu, __tnode_free_rcu);
>  }
>  
> +static void tnode_free_safe(struct tnode *tn)
> +{
> +	BUG_ON(IS_LEAF(tn));
> +	tn->tnode_free = tnode_free_head;
> +	tnode_free_head = tn;
> +}
> +
> +static void tnode_free_flush(void)
> +{
> +	struct tnode *tn;
> +
> +	while ((tn = tnode_free_head)) {
> +		tnode_free_head = tn->tnode_free;
> +		tn->tnode_free = NULL;
> +		tnode_free(tn);
> +	}
> +}
> +
>  static struct leaf *leaf_new(void)
>  {
>  	struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
> @@ -495,7 +516,7 @@ static struct node *resize(struct trie *
>  
>  	/* No children */
>  	if (tn->empty_children == tnode_child_length(tn)) {
> -		tnode_free(tn);
> +		tnode_free_safe(tn);
>  		return NULL;
>  	}
>  	/* One child */
> @@ -509,7 +530,7 @@ static struct node *resize(struct trie *
>  
>  			/* compress one level */
>  			node_set_parent(n, NULL);
> -			tnode_free(tn);
> +			tnode_free_safe(tn);
>  			return n;
>  		}
>  	/*
> @@ -670,7 +691,7 @@ static struct node *resize(struct trie *
>  			/* compress one level */
>  
>  			node_set_parent(n, NULL);
> -			tnode_free(tn);
> +			tnode_free_safe(tn);
>  			return n;
>  		}
>  
> @@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie
>  			put_child(t, tn, 2*i, inode->child[0]);
>  			put_child(t, tn, 2*i+1, inode->child[1]);
>  
> -			tnode_free(inode);
> +			tnode_free_safe(inode);
>  			continue;
>  		}
>  
> @@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie
>  		put_child(t, tn, 2*i, resize(t, left));
>  		put_child(t, tn, 2*i+1, resize(t, right));
>  
> -		tnode_free(inode);
> +		tnode_free_safe(inode);
>  	}
> -	tnode_free(oldtnode);
> +	tnode_free_safe(oldtnode);
>  	return tn;
>  nomem:
>  	{
> @@ -885,7 +906,7 @@ static struct tnode *halve(struct trie *
>  		put_child(t, newBinNode, 1, right);
>  		put_child(t, tn, i/2, resize(t, newBinNode));
>  	}
> -	tnode_free(oldtnode);
> +	tnode_free_safe(oldtnode);
>  	return tn;
>  nomem:
>  	{
> @@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key)
>  	return NULL;
>  }
>  
> -static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
> +static void trie_rebalance(struct trie *t, struct tnode *tn)
>  {
>  	int wasfull;
> -	t_key cindex, key = tn->key;
> +	t_key cindex, key;
>  	struct tnode *tp;
>  
> +	key = tn->key;
> +
>  	while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) {
>  		cindex = tkey_extract_bits(key, tp->pos, tp->bits);
>  		wasfull = tnode_full(tp, tnode_get_child(tp, cindex));
> @@ -998,6 +1021,7 @@ static struct node *trie_rebalance(struc
>  				      (struct node *)tn, wasfull);
>  
>  		tp = node_parent((struct node *) tn);
> +		tnode_free_flush();
>  		if (!tp)
>  			break;
>  		tn = tp;
> @@ -1007,7 +1031,10 @@ static struct node *trie_rebalance(struc
>  	if (IS_TNODE(tn))
>  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
>  
> -	return (struct node *)tn;
> +	rcu_assign_pointer(t->trie, (struct node *)tn);
> +	tnode_free_flush();
> +
> +	return;
>  }
>  
>  /* only used from updater-side */
> @@ -1155,7 +1182,7 @@ static struct list_head *fib_insert_node
>  
>  	/* Rebalance the trie */
>  
> -	rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
> +	trie_rebalance(t, tp);
>  done:
>  	return fa_head;
>  }
> @@ -1575,7 +1602,7 @@ static void trie_leaf_remove(struct trie
>  	if (tp) {
>  		t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits);
>  		put_child(t, (struct tnode *)tp, cindex, NULL);
> -		rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
> +		trie_rebalance(t, tp);
>  	} else
>  		rcu_assign_pointer(t->trie, NULL);
>  
>
>
>
>   


^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-28 21:36                 ` Jarek Poplawski
  2009-06-29  8:08                   ` Paweł Staszewski
@ 2009-06-29  8:33                   ` Jarek Poplawski
  2009-06-29  9:51                     ` Paweł Staszewski
  1 sibling, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-29  8:33 UTC (permalink / raw)
  To: David Miller
  Cc: =?UTF-8?B?UGF3ZcWCIFN0YXN6ZXdza2k=?=,
	Robert Olsson, Robert Olsson, Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

On 28-06-2009 23:36, Jarek Poplawski wrote:
> To David Miller:
> since among patches tested negatively by Pawel are current 2 fixes
> from 2.6.31-rc, I hope they weren't sent to -stable yet. Otherwise,
> please withdraw them until they are tested alone. Thanks.

David, IMHO this fix is needed in net-2.6 even if it doesn't fix the
problem reported by Pawel (there could be still something more).

Pawel, I see you decided to test my previous patch, but try to add
this one on top.

Thanks,
Jarek P.
------------------->
ipv4: Fix fib_trie rebalancing, part 3

Alas current delaying of freeing old tnodes by RCU in trie_rebalance
is still not enough because we can free a top tnode before updating a
t->trie pointer.

Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

 net/ipv4/fib_trie.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 012cf5a..00a54b2 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1021,6 +1021,9 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
 				      (struct node *)tn, wasfull);
 
 		tp = node_parent((struct node *) tn);
+		if (!tp)
+			rcu_assign_pointer(t->trie, (struct node *)tn);
+
 		tnode_free_flush();
 		if (!tp)
 			break;

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-29  8:08                   ` Paweł Staszewski
@ 2009-06-29  8:47                     ` Paweł Staszewski
  2009-06-29  9:27                       ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-29  8:47 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, Robert Olsson, Robert Olsson,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

But
With all this patches i have the same problem with CPU load
Every time when route cache entries are purged cpu load is increasing 
from 1% to 40 / 80% it depends

I see that on 64bit machine when route cache entries are going down i 
have almost 80% load on each cpu where ethernet card is binded by 
smp_affinity
But on 32bit machine cpu load reported by mpstat is half that on 64bit 
machine
here is example from 32bit machine ( mpstat + rtstat -k entries )

Linux 2.6.29.5 (TM_02_C1)       06/29/09        _i686_  (2 CPU)

12:36:54     CPU    %usr   %nice    %sys %iowait    %irq   %soft  
%steal  %guest   %idle        RT CACHE ENTRIES (from rtstat)
12:36:57     all    0.00    0.00    0.00    0.00    1.51   15.08    
0.00    0.00   83.42        83346
12:36:58     all    0.00    0.00    0.00    0.00    1.01    7.58    
0.00    0.00   91.41        85988
12:36:59     all    0.00    0.00    0.00    0.00    0.50    1.01    
0.00    0.00   98.49        89979
12:37:00     all    0.00    0.00    0.50    0.00    0.00    1.51    
0.00    0.00   97.99        93652
12:37:01     all    0.00    0.00    0.00    0.00    0.00    2.01    
0.00    0.00   97.99        96533
12:37:02     all    0.00    0.00    0.00    0.00    0.51    1.01    
0.00    0.00   98.48        99451
12:37:03     all    0.00    0.00    0.00    0.00    0.00    2.49    
0.00    0.00   97.51        102018
12:37:04     all    0.00    0.00    0.00    0.00    0.00    1.52    
0.00    0.00   98.48        104153
12:37:05     all    0.00    0.00    0.00    0.00    0.00    1.01    
0.00    0.00   98.99        105979
12:37:06     all    0.00    0.00    0.00    0.00    0.00    1.01    
0.00    0.00   98.99        107684
12:37:07     all    0.00    0.00    0.00    0.00    0.00    1.53    
0.00    0.00   98.47        109070
12:37:08     all    0.00    0.00    0.00    0.00    0.00    1.51    
0.00    0.00   98.49        110462
12:37:09     all    0.00    0.00    0.00    0.00    0.00    1.52    
0.00    0.00   98.48        112301
12:37:10     all    0.00    0.00    0.00    0.00    2.00   20.00    
0.00    0.00   78.00        111535
12:37:11     all    0.00    0.00    0.00    0.00    2.49   34.33    
0.00    0.00   63.18        108659
12:37:12     all    0.00    0.00    0.00    0.00    3.03   28.28    
0.00    0.00   68.69        105534
12:37:13     all    0.00    0.00    0.00    0.00    3.98   30.85    
0.00    0.00   65.17        103341
12:37:14     all    0.00    0.00    0.00    0.00    4.50   30.50    
0.00    0.00   65.00        101307
12:37:15     all    5.56    0.00    0.00    0.00    1.52   28.79    
0.00    0.00   64.14        97435
12:37:16     all   11.39    0.00    0.50    0.00    4.95   30.69    
0.00    0.00   52.48        93908
12:37:17     all    1.51    0.00    0.00    0.00    1.01   27.64    
0.00    0.00   69.85        90229
12:37:18     all    0.00    0.00    0.00    0.00    2.99   27.36    
0.00    0.00   69.65        87030
12:37:19     all    0.00    0.00    0.00    0.00    3.02   29.65    
0.00    0.00   67.34        84324
12:37:20     all    0.00    0.00    0.00    0.00    2.99   30.35    
0.00    0.00   66.67        82167
12:37:21     all    0.00    0.00    0.00    0.00    1.98   31.68    
0.00    0.00   66.34        80121
12:37:22     all    0.00    0.00    0.00    0.00    1.51   30.65    
0.00    0.00   67.84        77850
12:37:23     all    0.00    0.00    0.00    0.00    2.50   28.50    
0.00    0.00   69.00        76005
12:37:24     all    0.00    0.00    0.00    0.00    1.98   23.27    
0.00    0.00   74.75        74538
12:37:25     all    0.00    0.00    0.49    0.00    2.93   22.44    
0.00    0.00   74.15        76923
12:37:26     all    0.00    0.00    0.00    0.00    1.51   15.58    
0.00    0.00   82.91        79396
12:37:27     all    0.00    0.00    0.00    0.00    0.50    7.96    
0.00    0.00   91.54        81835
12:37:28     all    0.00    0.00    0.00    0.00    0.50    3.52    
0.00    0.00   95.98        84169
12:37:29     all    0.00    0.00    0.00    0.00    0.00    2.02    
0.00    0.00   97.98        87740
12:37:30     all    0.00    0.00    0.00    0.00    0.51    1.52    
0.00    0.00   97.98        91152
12:37:31     all    0.00    0.00    0.00    0.00    0.00    1.99    
0.00    0.00   98.01        94102
12:37:32     all    0.00    0.00    0.00    0.00    0.00    1.52    
0.00    0.00   98.48        97032
12:37:33     all    0.00    0.00    0.00    0.00    0.00    0.50    
0.00    0.00   99.50        99685
12:37:34     all    0.00    0.00    0.00    0.00    0.00    1.00    
0.00    0.00   99.00        101970
12:37:35     all    0.00    0.00    0.00    0.00    0.50    1.00    
0.00    0.00   98.50        103814
12:37:36     all    0.00    0.00    0.00    0.00    0.00    1.52    
0.00    0.00   98.48        104793
12:37:37     all    0.00    0.00    0.00    0.00    0.00    1.01    
0.00    0.00   98.99        106214
12:37:38     all    0.00    0.00    0.00    0.00    0.50    1.01    
0.00    0.00   98.49        107300
12:37:39     all    0.00    0.00    0.00    0.00    0.00   13.00    
0.00    0.00   87.00        111951
12:37:40     all    0.00    0.00    0.00    0.00    2.50   29.50    
0.00    0.00   68.00        111215
12:37:41     all    0.00    0.00    0.00    0.00    2.01   30.65    
0.00    0.00   67.34        108023
12:37:42     all    0.00    0.00    0.00    0.00    2.99   29.85    
0.00    0.00   67.16        104751
12:37:43     all    0.00    0.00    0.00    0.00    2.00   31.00    
0.00    0.00   67.00        100827
12:37:44     all    0.00    0.00    0.00    0.00    3.00   27.00    
0.00    0.00   70.00        97184
12:37:45     all    0.00    0.00    0.00    0.00    2.50   29.00    
0.00    0.00   68.50        93904
12:37:46     all    0.00    0.00    0.00    0.00    3.02   30.15    
0.00    0.00   66.83        90979
12:37:47     all    0.00    0.00    0.00    0.00    2.49   27.86    
0.00    0.00   69.65        88315
12:37:48     all    0.00    0.00    0.00    0.00    2.48   31.19    
0.00    0.00   66.34        87777
12:37:49     all    0.00    0.00    0.00    0.00    2.94   32.35    
0.00    0.00   64.71        89218
12:37:50     all    0.00    0.00    0.00    0.00    3.00   32.50    
0.00    0.00   64.50        85896
12:37:51     all    0.00    0.00    0.00    0.00    2.50   30.00    
0.00    0.00   67.50        82712
12:37:52     all    0.50    0.00    0.00    0.00    2.49   30.85    
0.00    0.00   66.17        79137
12:37:53     all    0.00    0.00    0.50    0.00    2.00   28.50    
0.00    0.00   69.00        75644
12:37:54     all    0.00    0.00    0.00    0.00    2.51   30.65    
0.00    0.00   66.83        72843
12:37:55     all    0.00    0.00    0.50    0.00    3.48   28.36    
0.00    0.00   67.66        73460


Paweł Staszewski pisze:
> Jarek Poplawski pisze:
>> To David Miller:
>> since among patches tested negatively by Pawel are current 2 fixes
>> from 2.6.31-rc, I hope they weren't sent to -stable yet. Otherwise,
>> please withdraw them until they are tested alone. Thanks.
>>
>> To Pawel:
>> On Sun, Jun 28, 2009 at 05:48:19PM +0200, Paweł Staszewski wrote:
>>  
>>> After apply this patch something is wrong
>>>
>>> Traffic is not forwarded
>>> no info in dmesg / no info from bgp
>>> and also i can't connect to bgpd process
>>>
>>> I revert kernel to past version with first Jarek patch
>>>
>>>     
>>
>> Since checking this can take time I attach here a patch with only
>> changes which are currently in 2.6.31-rc. Of course, this part can be
>> broken as well, so it's up to you: if you could try it with caution
>> somewhere it would be very helpful; otherwise don't bother.
>>
>> It could be applied to 2.6.29 with or without this currently working
>> patch.
>>
>>   
>
> Ok.
> I applied this patch 15mins ago to 2.6.29.5 and now it's working - 
> traffic is forwarded.
>
> Some fib_triestats
> cat /proc/net/fib_triestat
> Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
> Main:
>        Aver depth:     2.29
>        Max depth:      6
>        Leaves:         277015
>        Prefixes:       290493
>        Internal nodes: 67115
>          1: 35733  2: 13635  3: 9544  4: 4832  5: 2239  6: 1125  7: 5  
> 9: 1  18: 1
>        Pointers: 686614
> Null ptrs: 342485
> Total size: 18396  kB
>
> Counters:
> ---------
> gets = 3956301
> backtracks = 192497
> semantic match passed = 3895955
> semantic match miss = 133
> null node hit= 4306948
> skipped node resize = 0
>
> Local:
>        Aver depth:     3.75
>        Max depth:      5
>        Leaves:         12
>        Prefixes:       13
>        Internal nodes: 10
>          1: 9  2: 1
>        Pointers: 22
> Null ptrs: 1
> Total size: 2  kB
>
> Counters:
> ---------
> gets = 3960981
> backtracks = 2152441
> semantic match passed = 4757
> semantic match miss = 0
> null node hit= 194997
> skipped node resize = 0
>
>
>
>> Thanks,
>> Jarek P.
>> --- (for 2.6.29.x, .28 or .27)
>>
>> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>> --- a/net/ipv4/fib_trie.c    2009-06-27 20:25:06.000000000 +0200
>> +++ b/net/ipv4/fib_trie.c    2009-06-28 23:06:02.000000000 +0200
>> @@ -123,6 +123,7 @@ struct tnode {
>>      union {
>>          struct rcu_head rcu;
>>          struct work_struct work;
>> +        struct tnode *tnode_free;
>>      };
>>      struct node *child[0];
>>  };
>> @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
>>  static struct node *resize(struct trie *t, struct tnode *tn);
>>  static struct tnode *inflate(struct trie *t, struct tnode *tn);
>>  static struct tnode *halve(struct trie *t, struct tnode *tn);
>> +/* tnodes to free after resize(); protected by RTNL */
>> +static struct tnode *tnode_free_head;
>>  
>>  static struct kmem_cache *fn_alias_kmem __read_mostly;
>>  static struct kmem_cache *trie_leaf_kmem __read_mostly;
>> @@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
>>          call_rcu(&tn->rcu, __tnode_free_rcu);
>>  }
>>  
>> +static void tnode_free_safe(struct tnode *tn)
>> +{
>> +    BUG_ON(IS_LEAF(tn));
>> +    tn->tnode_free = tnode_free_head;
>> +    tnode_free_head = tn;
>> +}
>> +
>> +static void tnode_free_flush(void)
>> +{
>> +    struct tnode *tn;
>> +
>> +    while ((tn = tnode_free_head)) {
>> +        tnode_free_head = tn->tnode_free;
>> +        tn->tnode_free = NULL;
>> +        tnode_free(tn);
>> +    }
>> +}
>> +
>>  static struct leaf *leaf_new(void)
>>  {
>>      struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
>> @@ -495,7 +516,7 @@ static struct node *resize(struct trie *
>>  
>>      /* No children */
>>      if (tn->empty_children == tnode_child_length(tn)) {
>> -        tnode_free(tn);
>> +        tnode_free_safe(tn);
>>          return NULL;
>>      }
>>      /* One child */
>> @@ -509,7 +530,7 @@ static struct node *resize(struct trie *
>>  
>>              /* compress one level */
>>              node_set_parent(n, NULL);
>> -            tnode_free(tn);
>> +            tnode_free_safe(tn);
>>              return n;
>>          }
>>      /*
>> @@ -670,7 +691,7 @@ static struct node *resize(struct trie *
>>              /* compress one level */
>>  
>>              node_set_parent(n, NULL);
>> -            tnode_free(tn);
>> +            tnode_free_safe(tn);
>>              return n;
>>          }
>>  
>> @@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie
>>              put_child(t, tn, 2*i, inode->child[0]);
>>              put_child(t, tn, 2*i+1, inode->child[1]);
>>  
>> -            tnode_free(inode);
>> +            tnode_free_safe(inode);
>>              continue;
>>          }
>>  
>> @@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie
>>          put_child(t, tn, 2*i, resize(t, left));
>>          put_child(t, tn, 2*i+1, resize(t, right));
>>  
>> -        tnode_free(inode);
>> +        tnode_free_safe(inode);
>>      }
>> -    tnode_free(oldtnode);
>> +    tnode_free_safe(oldtnode);
>>      return tn;
>>  nomem:
>>      {
>> @@ -885,7 +906,7 @@ static struct tnode *halve(struct trie *
>>          put_child(t, newBinNode, 1, right);
>>          put_child(t, tn, i/2, resize(t, newBinNode));
>>      }
>> -    tnode_free(oldtnode);
>> +    tnode_free_safe(oldtnode);
>>      return tn;
>>  nomem:
>>      {
>> @@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key)
>>      return NULL;
>>  }
>>  
>> -static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
>> +static void trie_rebalance(struct trie *t, struct tnode *tn)
>>  {
>>      int wasfull;
>> -    t_key cindex, key = tn->key;
>> +    t_key cindex, key;
>>      struct tnode *tp;
>>  
>> +    key = tn->key;
>> +
>>      while (tn != NULL && (tp = node_parent((struct node *)tn)) != 
>> NULL) {
>>          cindex = tkey_extract_bits(key, tp->pos, tp->bits);
>>          wasfull = tnode_full(tp, tnode_get_child(tp, cindex));
>> @@ -998,6 +1021,7 @@ static struct node *trie_rebalance(struc
>>                        (struct node *)tn, wasfull);
>>  
>>          tp = node_parent((struct node *) tn);
>> +        tnode_free_flush();
>>          if (!tp)
>>              break;
>>          tn = tp;
>> @@ -1007,7 +1031,10 @@ static struct node *trie_rebalance(struc
>>      if (IS_TNODE(tn))
>>          tn = (struct tnode *)resize(t, (struct tnode *)tn);
>>  
>> -    return (struct node *)tn;
>> +    rcu_assign_pointer(t->trie, (struct node *)tn);
>> +    tnode_free_flush();
>> +
>> +    return;
>>  }
>>  
>>  /* only used from updater-side */
>> @@ -1155,7 +1182,7 @@ static struct list_head *fib_insert_node
>>  
>>      /* Rebalance the trie */
>>  
>> -    rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
>> +    trie_rebalance(t, tp);
>>  done:
>>      return fa_head;
>>  }
>> @@ -1575,7 +1602,7 @@ static void trie_leaf_remove(struct trie
>>      if (tp) {
>>          t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits);
>>          put_child(t, (struct tnode *)tp, cindex, NULL);
>> -        rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
>> +        trie_rebalance(t, tp);
>>      } else
>>          rcu_assign_pointer(t->trie, NULL);
>>  
>>
>>
>>
>>   
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-29  8:47                     ` Paweł Staszewski
@ 2009-06-29  9:27                       ` Jarek Poplawski
  2009-06-29  9:43                         ` Paweł Staszewski
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-29  9:27 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: David Miller, Robert Olsson, Robert Olsson,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

On Mon, Jun 29, 2009 at 10:47:44AM +0200, Paweł Staszewski wrote:
> But
> With all this patches i have the same problem with CPU load
> Every time when route cache entries are purged cpu load is increasing  
> from 1% to 40 / 80% it depends
>
> I see that on 64bit machine when route cache entries are going down i  
> have almost 80% load on each cpu where ethernet card is binded by  
> smp_affinity
> But on 32bit machine cpu load reported by mpstat is half that on 64bit  
> machine
> here is example from 32bit machine ( mpstat + rtstat -k entries )
>
> Linux 2.6.29.5 (TM_02_C1)       06/29/09        _i686_  (2 CPU)
>
> 12:36:54     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
> %guest   %idle        RT CACHE ENTRIES (from rtstat)
> 12:36:57     all    0.00    0.00    0.00    0.00    1.51   15.08    0.00  
>   0.00   83.42        83346

I guess Eric is thinking about this. Btw., two little suggestions:
it should be easier to track if these route cache reports stay in its
starting thread ("weird problem"?), and if you could send these
stats/logs as attachements or turn off line wrapping, please? ;-)

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-29  9:27                       ` Jarek Poplawski
@ 2009-06-29  9:43                         ` Paweł Staszewski
  0 siblings, 0 replies; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-29  9:43 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, Robert Olsson, Robert Olsson,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

Jarek Poplawski pisze:
> On Mon, Jun 29, 2009 at 10:47:44AM +0200, Paweł Staszewski wrote:
>   
>> But
>> With all this patches i have the same problem with CPU load
>> Every time when route cache entries are purged cpu load is increasing  
>> from 1% to 40 / 80% it depends
>>
>> I see that on 64bit machine when route cache entries are going down i  
>> have almost 80% load on each cpu where ethernet card is binded by  
>> smp_affinity
>> But on 32bit machine cpu load reported by mpstat is half that on 64bit  
>> machine
>> here is example from 32bit machine ( mpstat + rtstat -k entries )
>>
>> Linux 2.6.29.5 (TM_02_C1)       06/29/09        _i686_  (2 CPU)
>>
>> 12:36:54     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  
>> %guest   %idle        RT CACHE ENTRIES (from rtstat)
>> 12:36:57     all    0.00    0.00    0.00    0.00    1.51   15.08    0.00  
>>   0.00   83.42        83346
>>     
>
> I guess Eric is thinking about this. Btw., two little suggestions:
> it should be easier to track if these route cache reports stay in its
> starting thread ("weird problem"?), and if you could send these
> stats/logs as attachements or turn off line wrapping, please? ;-)
>
> Thanks,
> Jarek P.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>   
Sorry Jarek for combining problems :)
And yes i will apply next stats in attachements :)


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-29  8:33                   ` [PATCH net-2.6] " Jarek Poplawski
@ 2009-06-29  9:51                     ` Paweł Staszewski
  2009-06-29 10:47                       ` Jarek Poplawski
  2009-06-29 10:58                       ` [PATCH net-2.6] " Jarek Poplawski
  0 siblings, 2 replies; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-29  9:51 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, Robert Olsson, Robert Olsson,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

[-- Attachment #1: Type: text/plain, Size: 1699 bytes --]

I apply this patch

fib_triestats in attached file :)


Jarek Poplawski pisze:
> On 28-06-2009 23:36, Jarek Poplawski wrote:
>   
>> To David Miller:
>> since among patches tested negatively by Pawel are current 2 fixes
>> from 2.6.31-rc, I hope they weren't sent to -stable yet. Otherwise,
>> please withdraw them until they are tested alone. Thanks.
>>     
>
> David, IMHO this fix is needed in net-2.6 even if it doesn't fix the
> problem reported by Pawel (there could be still something more).
>
> Pawel, I see you decided to test my previous patch, but try to add
> this one on top.
>
> Thanks,
> Jarek P.
> ------------------->
> ipv4: Fix fib_trie rebalancing, part 3
>
> Alas current delaying of freeing old tnodes by RCU in trie_rebalance
> is still not enough because we can free a top tnode before updating a
> t->trie pointer.
>
> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> ---
>
>  net/ipv4/fib_trie.c |    3 +++
>  1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 012cf5a..00a54b2 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -1021,6 +1021,9 @@ static void trie_rebalance(struct trie *t, struct tnode *tn)
>  				      (struct node *)tn, wasfull);
>  
>  		tp = node_parent((struct node *) tn);
> +		if (!tp)
> +			rcu_assign_pointer(t->trie, (struct node *)tn);
> +
>  		tnode_free_flush();
>  		if (!tp)
>  			break;
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>   


[-- Attachment #2: fib_triestats.txt --]
[-- Type: text/plain, Size: 3032 bytes --]

cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.29
        Max depth:      7
        Leaves:         276909
        Prefixes:       290383
        Internal nodes: 66893
          1: 34715  2: 14024  3: 9889  4: 4833  5: 2275  6: 1150  7: 5  9: 1  18: 1
        Pointers: 691662
Null ptrs: 347861
Total size: 18403  kB

Counters:
---------
gets = 2297579
backtracks = 131491
semantic match passed = 2233070
semantic match miss = 42
null node hit= 2016883
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 2302102
backtracks = 1545197
semantic match passed = 4536
semantic match miss = 0
null node hit= 192664
skipped node resize = 0

----------------------------------------------------------------

cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.29
        Max depth:      7
        Leaves:         276904
        Prefixes:       290378
        Internal nodes: 66889
          1: 34711  2: 14023  3: 9890  4: 4833  5: 2275  6: 1150  7: 5  9: 1  18: 1
        Pointers: 691658
Null ptrs: 347866
Total size: 18402  kB

Counters:
---------
gets = 3006945
backtracks = 138787
semantic match passed = 2942047
semantic match miss = 85
null node hit= 2826377
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 3011504
backtracks = 1796587
semantic match passed = 4577
semantic match miss = 0
null node hit= 192747
skipped node resize = 0

--------------------------------------------------------------

cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.29
        Max depth:      7
        Leaves:         276904
        Prefixes:       290378
        Internal nodes: 66891
          1: 34710  2: 14025  3: 9892  4: 4832  5: 2275  6: 1150  7: 5  9: 1  18: 1
        Pointers: 691664
Null ptrs: 347870
Total size: 18402  kB

Counters:
---------
gets = 3320633
backtracks = 141904
semantic match passed = 3255585
semantic match miss = 99
null node hit= 3177543
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 3325226
backtracks = 1904022
semantic match passed = 4601
semantic match miss = 0
null node hit= 192782
skipped node resize = 0


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-29  9:51                     ` Paweł Staszewski
@ 2009-06-29 10:47                       ` Jarek Poplawski
  2009-06-29 16:24                         ` Paweł Staszewski
  2009-06-30  7:09                         ` Jarek Poplawski
  2009-06-29 10:58                       ` [PATCH net-2.6] " Jarek Poplawski
  1 sibling, 2 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-29 10:47 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: David Miller, Robert Olsson, Robert Olsson,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote:
> I apply this patch
>
> fib_triestats in attached file :)

Great! But it would be nice to check if this (accidentally ;-) might
fix the previous problem, so I attach below the patch with "manual
RCU", which btw. (or even more important) should verify RCU use here.

It should be applied on top of this last "Fix..., part3". And
again: it's quite probable it can fail, so with caution, no hurry
(it can wait for quiet time)...

Many thanks,
Jarek P.
--------------------> (synchronize_rcu take 4)

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-06-29 10:00:14.000000000 +0000
+++ b/net/ipv4/fib_trie.c	2009-06-29 10:04:22.000000000 +0000
@@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_st
 	vfree(tn);
 }
 
+static void __tnode_free(struct tnode *tn)
+{
+	size_t size = sizeof(struct tnode) +
+		      (sizeof(struct node *) << tn->bits);
+
+	if (size <= PAGE_SIZE)
+		kfree(tn);
+	else
+		vfree(tn);
+}
+
 static void __tnode_free_rcu(struct rcu_head *head)
 {
 	struct tnode *tn = container_of(head, struct tnode, rcu);
@@ -402,7 +413,7 @@ static void tnode_free_flush(void)
 	while ((tn = tnode_free_head)) {
 		tnode_free_head = tn->tnode_free;
 		tn->tnode_free = NULL;
-		tnode_free(tn);
+		__tnode_free(tn);
 	}
 }
 
@@ -1021,21 +1032,27 @@ static void trie_rebalance(struct trie *
 				      (struct node *)tn, wasfull);
 
 		tp = node_parent((struct node *) tn);
-		if (!tp)
+		if (!tp) {
 			rcu_assign_pointer(t->trie, (struct node *)tn);
-
-		tnode_free_flush();
-		if (!tp)
 			break;
+		}
 		tn = tp;
 	}
 
+	if (tnode_free_head) {
+		synchronize_rcu();
+		tnode_free_flush();
+	}
+
 	/* Handle last (top) tnode */
-	if (IS_TNODE(tn))
+	if (IS_TNODE(tn)) {
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
-
-	rcu_assign_pointer(t->trie, (struct node *)tn);
-	tnode_free_flush();
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+		synchronize_rcu();
+		tnode_free_flush();
+	} else {
+		rcu_assign_pointer(t->trie, (struct node *)tn);
+	}
 
 	return;
 }

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-29  9:51                     ` Paweł Staszewski
  2009-06-29 10:47                       ` Jarek Poplawski
@ 2009-06-29 10:58                       ` Jarek Poplawski
  2009-06-30 19:48                         ` David Miller
  1 sibling, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-29 10:58 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: David Miller, Robert Olsson, Robert Olsson,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote:
> I apply this patch
>
> fib_triestats in attached file :)
>
>> ------------------->
>> ipv4: Fix fib_trie rebalancing, part 3
>>
>> Alas current delaying of freeing old tnodes by RCU in trie_rebalance
>> is still not enough because we can free a top tnode before updating a
>> t->trie pointer.
>>
>> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
>> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
>> ---

David, I guess you could add:

Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-29 10:47                       ` Jarek Poplawski
@ 2009-06-29 16:24                         ` Paweł Staszewski
  2009-06-29 17:09                           ` Jarek Poplawski
  2009-06-30  7:09                         ` Jarek Poplawski
  1 sibling, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-29 16:24 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, Robert Olsson, Robert Olsson,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

[-- Attachment #1: Type: text/plain, Size: 2975 bytes --]

Jarek Poplawski pisze:
> On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote:
>   
>> I apply this patch
>>
>> fib_triestats in attached file :)
>>     
>
> Great! But it would be nice to check if this (accidentally ;-) might
> fix the previous problem, so I attach below the patch with "manual
> RCU", which btw. (or even more important) should verify RCU use here.
>
>   
After this patches all is OK now i don't see Fix inflate_threshold_root.
Even if i make "clear ip bgp * "

Before this patches when i make clear ip bgp there was always info in 
dmesg about "Fix inflate_threshold_root"


> It should be applied on top of this last "Fix..., part3". And
> again: it's quite probable it can fail, so with caution, no hurry
> (it can wait for quiet time)...
>
>   
After apply this last patch - traffic is not forwarded again :)
i was fast and have only some fib_triestats in attached file before 
failover switch routers.

This stats are from machine with this last patch that makes kernel to 
stop forwarding


> Many thanks,
> Jarek P.
> --------------------> (synchronize_rcu take 4)
>
> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> --- a/net/ipv4/fib_trie.c	2009-06-29 10:00:14.000000000 +0000
> +++ b/net/ipv4/fib_trie.c	2009-06-29 10:04:22.000000000 +0000
> @@ -366,6 +366,17 @@ static void __tnode_vfree(struct work_st
>  	vfree(tn);
>  }
>  
> +static void __tnode_free(struct tnode *tn)
> +{
> +	size_t size = sizeof(struct tnode) +
> +		      (sizeof(struct node *) << tn->bits);
> +
> +	if (size <= PAGE_SIZE)
> +		kfree(tn);
> +	else
> +		vfree(tn);
> +}
> +
>  static void __tnode_free_rcu(struct rcu_head *head)
>  {
>  	struct tnode *tn = container_of(head, struct tnode, rcu);
> @@ -402,7 +413,7 @@ static void tnode_free_flush(void)
>  	while ((tn = tnode_free_head)) {
>  		tnode_free_head = tn->tnode_free;
>  		tn->tnode_free = NULL;
> -		tnode_free(tn);
> +		__tnode_free(tn);
>  	}
>  }
>  
> @@ -1021,21 +1032,27 @@ static void trie_rebalance(struct trie *
>  				      (struct node *)tn, wasfull);
>  
>  		tp = node_parent((struct node *) tn);
> -		if (!tp)
> +		if (!tp) {
>  			rcu_assign_pointer(t->trie, (struct node *)tn);
> -
> -		tnode_free_flush();
> -		if (!tp)
>  			break;
> +		}
>  		tn = tp;
>  	}
>  
> +	if (tnode_free_head) {
> +		synchronize_rcu();
> +		tnode_free_flush();
> +	}
> +
>  	/* Handle last (top) tnode */
> -	if (IS_TNODE(tn))
> +	if (IS_TNODE(tn)) {
>  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
> -
> -	rcu_assign_pointer(t->trie, (struct node *)tn);
> -	tnode_free_flush();
> +		rcu_assign_pointer(t->trie, (struct node *)tn);
> +		synchronize_rcu();
> +		tnode_free_flush();
> +	} else {
> +		rcu_assign_pointer(t->trie, (struct node *)tn);
> +	}
>  
>  	return;
>  }
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>   


[-- Attachment #2: fib_triestats.txt --]
[-- Type: text/plain, Size: 890 bytes --]

cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     3.51
        Max depth:      8
        Leaves:         3089
        Prefixes:       3156
        Internal nodes: 1167
          1: 737  2: 202  3: 150  4: 40  5: 28  6: 8  7: 1  10: 1
        Pointers: 6682
Null ptrs: 2427
Total size: 214  kB

Counters:
---------
gets = 1554240
backtracks = 916511
semantic match passed = 1127691
semantic match miss = 27
null node hit= 1439140
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 1554767
backtracks = 572694
semantic match passed = 534
semantic match miss = 0
null node hit= 288
skipped node resize = 0


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-29 16:24                         ` Paweł Staszewski
@ 2009-06-29 17:09                           ` Jarek Poplawski
  0 siblings, 0 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-29 17:09 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: David Miller, Robert Olsson, Robert Olsson,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

On Mon, Jun 29, 2009 at 06:24:47PM +0200, Paweł Staszewski wrote:
> Jarek Poplawski pisze:
>> On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote:
>>   
>>> I apply this patch
>>>
>>> fib_triestats in attached file :)
>>>     
>>
>> Great! But it would be nice to check if this (accidentally ;-) might
>> fix the previous problem, so I attach below the patch with "manual
>> RCU", which btw. (or even more important) should verify RCU use here.
>>
>>   
> After this patches all is OK now i don't see Fix inflate_threshold_root.
> Even if i make "clear ip bgp * "
>
> Before this patches when i make clear ip bgp there was always info in  
> dmesg about "Fix inflate_threshold_root"
>
>
>> It should be applied on top of this last "Fix..., part3". And
>> again: it's quite probable it can fail, so with caution, no hurry
>> (it can wait for quiet time)...
>>
>>   
> After apply this last patch - traffic is not forwarded again :)
> i was fast and have only some fib_triestats in attached file before  
> failover switch routers.
>
> This stats are from machine with this last patch that makes kernel to  
> stop forwarding

OK, I'll look at it again.

Thanks for testing!
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-29 10:47                       ` Jarek Poplawski
  2009-06-29 16:24                         ` Paweł Staszewski
@ 2009-06-30  7:09                         ` Jarek Poplawski
  2009-06-30 20:16                           ` Paweł Staszewski
  1 sibling, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-30  7:09 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: David Miller, Robert Olsson, Robert Olsson,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

On Mon, Jun 29, 2009 at 10:47:03AM +0000, Jarek Poplawski wrote:
> On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote:
> > I apply this patch
> >
> > fib_triestats in attached file :)
> 
> Great! But it would be nice to check if this (accidentally ;-) might
> fix the previous problem, so I attach below the patch with "manual
> RCU", which btw. (or even more important) should verify RCU use here.
> 
> It should be applied on top of this last "Fix..., part3". And
> again: it's quite probable it can fail, so with caution, no hurry
> (it can wait for quiet time)...

Pawel, here is another try to check what's going on here, so just
like before, but this one on top of these 2 last working patches,
plus quite time... (Stats aren't necessary; if these are some doubts
let me know.)

Thanks,
Jarek P.
--------------------> (synchronize_rcu take 5)

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-06-29 10:00:14.000000000 +0000
+++ b/net/ipv4/fib_trie.c	2009-06-30 06:50:35.000000000 +0000
@@ -1036,6 +1036,7 @@ static void trie_rebalance(struct trie *
 
 	rcu_assign_pointer(t->trie, (struct node *)tn);
 	tnode_free_flush();
+	synchronize_rcu();
 
 	return;
 }

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-29 10:58                       ` [PATCH net-2.6] " Jarek Poplawski
@ 2009-06-30 19:48                         ` David Miller
  2009-06-30 20:14                           ` Jarek Poplawski
  2009-07-10 15:29                           ` Stephen Hemminger
  0 siblings, 2 replies; 99+ messages in thread
From: David Miller @ 2009-06-30 19:48 UTC (permalink / raw)
  To: jarkao2
  Cc: pstaszewski, robert, Robert.Olsson, jorge, dada1, robert.olsson, netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon, 29 Jun 2009 10:58:20 +0000

> On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote:
>> I apply this patch
>>
>> fib_triestats in attached file :)
>>
>>> ------------------->
>>> ipv4: Fix fib_trie rebalancing, part 3
>>>
>>> Alas current delaying of freeing old tnodes by RCU in trie_rebalance
>>> is still not enough because we can free a top tnode before updating a
>>> t->trie pointer.
>>>
>>> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
>>> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
>>> ---
> 
> David, I guess you could add:
> 
> Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>

Done, and applied, thanks Jarek.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-30 19:48                         ` David Miller
@ 2009-06-30 20:14                           ` Jarek Poplawski
  2009-07-10 15:29                           ` Stephen Hemminger
  1 sibling, 0 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-30 20:14 UTC (permalink / raw)
  To: David Miller
  Cc: pstaszewski, robert, Robert.Olsson, jorge, dada1, robert.olsson, netdev

On Tue, Jun 30, 2009 at 12:48:49PM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Mon, 29 Jun 2009 10:58:20 +0000
> 
> > On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote:
> >> I apply this patch
> >>
> >> fib_triestats in attached file :)
> >>
> >>> ------------------->
> >>> ipv4: Fix fib_trie rebalancing, part 3
> >>>
> >>> Alas current delaying of freeing old tnodes by RCU in trie_rebalance
> >>> is still not enough because we can free a top tnode before updating a
> >>> t->trie pointer.
> >>>
> >>> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
> >>> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> >>> ---
> > 
> > David, I guess you could add:
> > 
> > Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
> 
> Done, and applied, thanks Jarek.

Btw., a little comment: there are still some issues while trying to
reclaim memory after synchronize_rcu, which means the algorithm is
buggy, or RCU use is still buggy, or maybe some timing because of
synchronize_rcu. Anyway, fib_trie still seems to be safe only with
CONFIG_PREEMPT_NONE, so I have no idea how this should be fixed in
-stables (or why people don't report more this BUG in 2.6.30)...

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-30  7:09                         ` Jarek Poplawski
@ 2009-06-30 20:16                           ` Paweł Staszewski
  2009-06-30 20:41                             ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-30 20:16 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, Robert Olsson, Robert Olsson,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

Jarek Poplawski pisze:
> On Mon, Jun 29, 2009 at 10:47:03AM +0000, Jarek Poplawski wrote:
>   
>> On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote:
>>     
>>> I apply this patch
>>>
>>> fib_triestats in attached file :)
>>>       
>> Great! But it would be nice to check if this (accidentally ;-) might
>> fix the previous problem, so I attach below the patch with "manual
>> RCU", which btw. (or even more important) should verify RCU use here.
>>
>> It should be applied on top of this last "Fix..., part3". And
>> again: it's quite probable it can fail, so with caution, no hurry
>> (it can wait for quiet time)...
>>     
>
> Pawel, here is another try to check what's going on here, so just
> like before, but this one on top of these 2 last working patches,
> plus quite time... (Stats aren't necessary; if these are some doubts
> let me know.)
>
> Thanks,
> Jarek P.
> --------------------> (synchronize_rcu take 5)
>
> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> --- a/net/ipv4/fib_trie.c	2009-06-29 10:00:14.000000000 +0000
> +++ b/net/ipv4/fib_trie.c	2009-06-30 06:50:35.000000000 +0000
> @@ -1036,6 +1036,7 @@ static void trie_rebalance(struct trie *
>  
>  	rcu_assign_pointer(t->trie, (struct node *)tn);
>  	tnode_free_flush();
> +	synchronize_rcu();
>  
>  	return;
>  }
>   

Apply and tested

Traffic is not forwarded after apply this patch.:)

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>   


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-30 20:16                           ` Paweł Staszewski
@ 2009-06-30 20:41                             ` Jarek Poplawski
  2009-06-30 23:31                               ` Paweł Staszewski
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-06-30 20:41 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: David Miller, Robert Olsson, Robert Olsson,
	Jorge Boncompte [DTI2],
	Eric Dumazet, Robert Olsson, Linux Network Development list

On Tue, Jun 30, 2009 at 10:16:57PM +0200, Paweł Staszewski wrote:
> Jarek Poplawski pisze:
>> On Mon, Jun 29, 2009 at 10:47:03AM +0000, Jarek Poplawski wrote:
>>   
>>> On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote:
>>>     
>>>> I apply this patch
>>>>
>>>> fib_triestats in attached file :)
>>>>       
>>> Great! But it would be nice to check if this (accidentally ;-) might
>>> fix the previous problem, so I attach below the patch with "manual
>>> RCU", which btw. (or even more important) should verify RCU use here.
>>>
>>> It should be applied on top of this last "Fix..., part3". And
>>> again: it's quite probable it can fail, so with caution, no hurry
>>> (it can wait for quiet time)...
>>>     
>>
>> Pawel, here is another try to check what's going on here, so just
>> like before, but this one on top of these 2 last working patches,
>> plus quite time... (Stats aren't necessary; if these are some doubts
>> let me know.)
>>
>> Thanks,
>> Jarek P.
>> --------------------> (synchronize_rcu take 5)
>>
>> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>> --- a/net/ipv4/fib_trie.c	2009-06-29 10:00:14.000000000 +0000
>> +++ b/net/ipv4/fib_trie.c	2009-06-30 06:50:35.000000000 +0000
>> @@ -1036,6 +1036,7 @@ static void trie_rebalance(struct trie *
>>   	rcu_assign_pointer(t->trie, (struct node *)tn);
>>  	tnode_free_flush();
>> +	synchronize_rcu();
>>   	return;
>>  }
>>   
>
> Apply and tested
>
> Traffic is not forwarded after apply this patch.:)

A little comment: these last 2 patches weren't exactly to fix the
problem you reported, which should be mostly fixed by the earlier
patch.

There is some other bug, which you omit with CONFIG_PREEMPT_NONE
(but it's not for sure there is no by effects). So, I'd like to be
sure you are willing and can (without too much risk) to do more such
tests. Alas I've no way to generate similar conditions so it would
simply have to wait for somebody else.

Many thanks again,
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-30 20:41                             ` Jarek Poplawski
@ 2009-06-30 23:31                               ` Paweł Staszewski
  2009-07-01  6:36                                 ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-06-30 23:31 UTC (permalink / raw)
  To: Jarek Poplawski, Linux Network Development list

Jarek Poplawski pisze:
> On Tue, Jun 30, 2009 at 10:16:57PM +0200, Paweł Staszewski wrote:
>   
>> Jarek Poplawski pisze:
>>     
>>> On Mon, Jun 29, 2009 at 10:47:03AM +0000, Jarek Poplawski wrote:
>>>   
>>>       
>>>> On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote:
>>>>     
>>>>         
>>>>> I apply this patch
>>>>>
>>>>> fib_triestats in attached file :)
>>>>>       
>>>>>           
>>>> Great! But it would be nice to check if this (accidentally ;-) might
>>>> fix the previous problem, so I attach below the patch with "manual
>>>> RCU", which btw. (or even more important) should verify RCU use here.
>>>>
>>>> It should be applied on top of this last "Fix..., part3". And
>>>> again: it's quite probable it can fail, so with caution, no hurry
>>>> (it can wait for quiet time)...
>>>>     
>>>>         
>>> Pawel, here is another try to check what's going on here, so just
>>> like before, but this one on top of these 2 last working patches,
>>> plus quite time... (Stats aren't necessary; if these are some doubts
>>> let me know.)
>>>
>>> Thanks,
>>> Jarek P.
>>> --------------------> (synchronize_rcu take 5)
>>>
>>> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>>> --- a/net/ipv4/fib_trie.c	2009-06-29 10:00:14.000000000 +0000
>>> +++ b/net/ipv4/fib_trie.c	2009-06-30 06:50:35.000000000 +0000
>>> @@ -1036,6 +1036,7 @@ static void trie_rebalance(struct trie *
>>>   	rcu_assign_pointer(t->trie, (struct node *)tn);
>>>  	tnode_free_flush();
>>> +	synchronize_rcu();
>>>   	return;
>>>  }
>>>   
>>>       
>> Apply and tested
>>
>> Traffic is not forwarded after apply this patch.:)
>>     
>
> A little comment: these last 2 patches weren't exactly to fix the
> problem you reported, which should be mostly fixed by the earlier
> patch.
>
> There is some other bug, which you omit with CONFIG_PREEMPT_NONE
> (but it's not for sure there is no by effects). So, I'd like to be
> sure you are willing and can (without too much risk) to do more such
> tests. Alas I've no way to generate similar conditions so it would
> simply have to wait for somebody else.
>
>   
Yes i can make tests like this.
My network is splited to test clients and other normal clients
so it's really no problem to make testing. - if testing clients working 
then traffic from normal clients is also switched to this router (but if 
traffic is not forwarded "like in this case" for testing clients then 
failover switching them to working router )

and other point to make this tests - is that - it is good to have all in 
linux kernel networking working well :)

Regards
Paweł Staszewski

> Many thanks again,
> Jarek P.
>
>
>   


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-30 23:31                               ` Paweł Staszewski
@ 2009-07-01  6:36                                 ` Jarek Poplawski
       [not found]                                   ` <20090701072409.GA12592@ff.dom.local>
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-01  6:36 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list

On Wed, Jul 01, 2009 at 01:31:09AM +0200, Paweł Staszewski wrote:
...
> Yes i can make tests like this.
> My network is splited to test clients and other normal clients
> so it's really no problem to make testing. - if testing clients working  
> then traffic from normal clients is also switched to this router (but if  
> traffic is not forwarded "like in this case" for testing clients then  
> failover switching them to working router )
>
> and other point to make this tests - is that - it is good to have all in  
> linux kernel networking working well :)

It's extremely nice of you! On the other hand, this type of change
was planned to the net-next to fix possible memory problems, which
might have happened to you as well. So you'd probably experience this
problem in the future (2.6.32) anyway.

So here is the first of 2 patches (the second in a separate message),
which should be tested separately, each one applied on top of the
2.6.29.x (vanilla - at least fib_trie.c), after reverting the previous
one. So, they are again all-in-one, to eclude any misunderstanding.

Btw., I assume there were no oopses, warnings or lockups after those
previous non-working patches - only no routing/forwarding.

Thanks,
Jarek P.
----------> (synchronize take 6 all-in-one for 2.6.29x, .28, or .27)

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-06-29 05:30:50.000000000 +0000
+++ b/net/ipv4/fib_trie.c	2009-07-01 05:15:37.000000000 +0000
@@ -123,6 +123,7 @@ struct tnode {
 	union {
 		struct rcu_head rcu;
 		struct work_struct work;
+		struct tnode *tnode_free;
 	};
 	struct node *child[0];
 };
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
 static struct node *resize(struct trie *t, struct tnode *tn);
 static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
 		call_rcu(&tn->rcu, __tnode_free_rcu);
 }
 
+static void tnode_free_safe(struct tnode *tn)
+{
+	BUG_ON(IS_LEAF(tn));
+	tn->tnode_free = tnode_free_head;
+	tnode_free_head = tn;
+}
+
+static void tnode_free_flush(void)
+{
+	struct tnode *tn;
+
+	while ((tn = tnode_free_head)) {
+		tnode_free_head = tn->tnode_free;
+		tn->tnode_free = NULL;
+		tnode_free(tn);
+	}
+}
+
 static struct leaf *leaf_new(void)
 {
 	struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
@@ -495,7 +516,7 @@ static struct node *resize(struct trie *
 
 	/* No children */
 	if (tn->empty_children == tnode_child_length(tn)) {
-		tnode_free(tn);
+		tnode_free_safe(tn);
 		return NULL;
 	}
 	/* One child */
@@ -509,7 +530,7 @@ static struct node *resize(struct trie *
 
 			/* compress one level */
 			node_set_parent(n, NULL);
-			tnode_free(tn);
+			tnode_free_safe(tn);
 			return n;
 		}
 	/*
@@ -670,7 +691,7 @@ static struct node *resize(struct trie *
 			/* compress one level */
 
 			node_set_parent(n, NULL);
-			tnode_free(tn);
+			tnode_free_safe(tn);
 			return n;
 		}
 
@@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie
 			put_child(t, tn, 2*i, inode->child[0]);
 			put_child(t, tn, 2*i+1, inode->child[1]);
 
-			tnode_free(inode);
+			tnode_free_safe(inode);
 			continue;
 		}
 
@@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie
 		put_child(t, tn, 2*i, resize(t, left));
 		put_child(t, tn, 2*i+1, resize(t, right));
 
-		tnode_free(inode);
+		tnode_free_safe(inode);
 	}
-	tnode_free(oldtnode);
+	tnode_free_safe(oldtnode);
 	return tn;
 nomem:
 	{
@@ -885,7 +906,7 @@ static struct tnode *halve(struct trie *
 		put_child(t, newBinNode, 1, right);
 		put_child(t, tn, i/2, resize(t, newBinNode));
 	}
-	tnode_free(oldtnode);
+	tnode_free_safe(oldtnode);
 	return tn;
 nomem:
 	{
@@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key)
 	return NULL;
 }
 
-static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
+static void trie_rebalance(struct trie *t, struct tnode *tn, bool sync)
 {
 	int wasfull;
-	t_key cindex, key = tn->key;
+	t_key cindex, key;
 	struct tnode *tp;
 
+	key = tn->key;
+
 	while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) {
 		cindex = tkey_extract_bits(key, tp->pos, tp->bits);
 		wasfull = tnode_full(tp, tnode_get_child(tp, cindex));
@@ -999,6 +1022,10 @@ static struct node *trie_rebalance(struc
 
 		tp = node_parent((struct node *) tn);
 		if (!tp)
+			rcu_assign_pointer(t->trie, (struct node *)tn);
+
+		//tnode_free_flush();
+		if (!tp)
 			break;
 		tn = tp;
 	}
@@ -1007,7 +1034,12 @@ static struct node *trie_rebalance(struc
 	if (IS_TNODE(tn))
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
 
-	return (struct node *)tn;
+	rcu_assign_pointer(t->trie, (struct node *)tn);
+	if (sync)
+		synchronize_rcu();
+	tnode_free_flush();
+
+	return;
 }
 
 /* only used from updater-side */
@@ -1155,7 +1187,7 @@ static struct list_head *fib_insert_node
 
 	/* Rebalance the trie */
 
-	rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
+	trie_rebalance(t, tp, true);
 done:
 	return fa_head;
 }
@@ -1575,7 +1607,7 @@ static void trie_leaf_remove(struct trie
 	if (tp) {
 		t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits);
 		put_child(t, (struct tnode *)tp, cindex, NULL);
-		rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
+		trie_rebalance(t, tp, false);
 	} else
 		rcu_assign_pointer(t->trie, NULL);
 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
       [not found]                                   ` <20090701072409.GA12592@ff.dom.local>
@ 2009-07-01  9:43                                     ` Paweł Staszewski
  2009-07-01  9:50                                       ` Paweł Staszewski
  2009-07-01 10:13                                       ` Jarek Poplawski
  0 siblings, 2 replies; 99+ messages in thread
From: Paweł Staszewski @ 2009-07-01  9:43 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Linux Network Development list, Robert Olsson

Jarek Poplawski pisze:
> On Wed, Jul 01, 2009 at 06:36:51AM +0000, Jarek Poplawski wrote:
>   
>> On Wed, Jul 01, 2009 at 01:31:09AM +0200, Paweł Staszewski wrote:
>> ...
>>     
>
> It looks like Cc was shortened BTW, but I guess at least Robert is
> interested in this testing, so I add him back.
>
> Cheers,
> Jarek P.
>
>   
>>> Yes i can make tests like this.
>>> My network is splited to test clients and other normal clients
>>> so it's really no problem to make testing. - if testing clients working  
>>> then traffic from normal clients is also switched to this router (but if  
>>> traffic is not forwarded "like in this case" for testing clients then  
>>> failover switching them to working router )
>>>
>>> and other point to make this tests - is that - it is good to have all in  
>>> linux kernel networking working well :)
>>>       
>> It's extremely nice of you! On the other hand, this type of change
>> was planned to the net-next to fix possible memory problems, which
>> might have happened to you as well. So you'd probably experience this
>> problem in the future (2.6.32) anyway.
>>
>> So here is the first of 2 patches (the second in a separate message),
>> which should be tested separately, each one applied on top of the
>> 2.6.29.x (vanilla - at least fib_trie.c), after reverting the previous
>> one. So, they are again all-in-one, to eclude any misunderstanding.
>>
>> Btw., I assume there were no oopses, warnings or lockups after those
>> previous non-working patches - only no routing/forwarding.
>>
>>     
Yes on on previous patches there was / no warnings / no oopses or lockups

But now i apply this patch and i make more testing.
First boot with start of bgpd and - traffic is not forwarded
So i start to search and make only some routes (static without bgpd) 
thru this host
And all is working for this host when i make all by static routes.

So i change a little my bgp configuration and make default route to only 
one of my iBGP peers and start bgpd process
All is working and what is weird is number of routes in kernel table.
Kernel is learning routes from bgpd but very slowly - really very slowly.

In attached file there are some fib_triestats after 5min of traffic.

Without this patch (normally)
total size: reported by fib_triestats in less that 1sec is:  "Total 
size: 35769  kB"

But with this patch
Total size is growing up and in 5 min of traffic it grow to only:  
"Total size: 1005  kB"

Regards
Paweł Staszewski
 
>> Thanks,
>> Jarek P.
>> ----------> (synchronize take 6 all-in-one for 2.6.29x, .28, or .27)
>>
>> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>> --- a/net/ipv4/fib_trie.c	2009-06-29 05:30:50.000000000 +0000
>> +++ b/net/ipv4/fib_trie.c	2009-07-01 05:15:37.000000000 +0000
>> @@ -123,6 +123,7 @@ struct tnode {
>>  	union {
>>  		struct rcu_head rcu;
>>  		struct work_struct work;
>> +		struct tnode *tnode_free;
>>  	};
>>  	struct node *child[0];
>>  };
>> @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
>>  static struct node *resize(struct trie *t, struct tnode *tn);
>>  static struct tnode *inflate(struct trie *t, struct tnode *tn);
>>  static struct tnode *halve(struct trie *t, struct tnode *tn);
>> +/* tnodes to free after resize(); protected by RTNL */
>> +static struct tnode *tnode_free_head;
>>  
>>  static struct kmem_cache *fn_alias_kmem __read_mostly;
>>  static struct kmem_cache *trie_leaf_kmem __read_mostly;
>> @@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
>>  		call_rcu(&tn->rcu, __tnode_free_rcu);
>>  }
>>  
>> +static void tnode_free_safe(struct tnode *tn)
>> +{
>> +	BUG_ON(IS_LEAF(tn));
>> +	tn->tnode_free = tnode_free_head;
>> +	tnode_free_head = tn;
>> +}
>> +
>> +static void tnode_free_flush(void)
>> +{
>> +	struct tnode *tn;
>> +
>> +	while ((tn = tnode_free_head)) {
>> +		tnode_free_head = tn->tnode_free;
>> +		tn->tnode_free = NULL;
>> +		tnode_free(tn);
>> +	}
>> +}
>> +
>>  static struct leaf *leaf_new(void)
>>  {
>>  	struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
>> @@ -495,7 +516,7 @@ static struct node *resize(struct trie *
>>  
>>  	/* No children */
>>  	if (tn->empty_children == tnode_child_length(tn)) {
>> -		tnode_free(tn);
>> +		tnode_free_safe(tn);
>>  		return NULL;
>>  	}
>>  	/* One child */
>> @@ -509,7 +530,7 @@ static struct node *resize(struct trie *
>>  
>>  			/* compress one level */
>>  			node_set_parent(n, NULL);
>> -			tnode_free(tn);
>> +			tnode_free_safe(tn);
>>  			return n;
>>  		}
>>  	/*
>> @@ -670,7 +691,7 @@ static struct node *resize(struct trie *
>>  			/* compress one level */
>>  
>>  			node_set_parent(n, NULL);
>> -			tnode_free(tn);
>> +			tnode_free_safe(tn);
>>  			return n;
>>  		}
>>  
>> @@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie
>>  			put_child(t, tn, 2*i, inode->child[0]);
>>  			put_child(t, tn, 2*i+1, inode->child[1]);
>>  
>> -			tnode_free(inode);
>> +			tnode_free_safe(inode);
>>  			continue;
>>  		}
>>  
>> @@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie
>>  		put_child(t, tn, 2*i, resize(t, left));
>>  		put_child(t, tn, 2*i+1, resize(t, right));
>>  
>> -		tnode_free(inode);
>> +		tnode_free_safe(inode);
>>  	}
>> -	tnode_free(oldtnode);
>> +	tnode_free_safe(oldtnode);
>>  	return tn;
>>  nomem:
>>  	{
>> @@ -885,7 +906,7 @@ static struct tnode *halve(struct trie *
>>  		put_child(t, newBinNode, 1, right);
>>  		put_child(t, tn, i/2, resize(t, newBinNode));
>>  	}
>> -	tnode_free(oldtnode);
>> +	tnode_free_safe(oldtnode);
>>  	return tn;
>>  nomem:
>>  	{
>> @@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key)
>>  	return NULL;
>>  }
>>  
>> -static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
>> +static void trie_rebalance(struct trie *t, struct tnode *tn, bool sync)
>>  {
>>  	int wasfull;
>> -	t_key cindex, key = tn->key;
>> +	t_key cindex, key;
>>  	struct tnode *tp;
>>  
>> +	key = tn->key;
>> +
>>  	while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) {
>>  		cindex = tkey_extract_bits(key, tp->pos, tp->bits);
>>  		wasfull = tnode_full(tp, tnode_get_child(tp, cindex));
>> @@ -999,6 +1022,10 @@ static struct node *trie_rebalance(struc
>>  
>>  		tp = node_parent((struct node *) tn);
>>  		if (!tp)
>> +			rcu_assign_pointer(t->trie, (struct node *)tn);
>> +
>> +		//tnode_free_flush();
>> +		if (!tp)
>>  			break;
>>  		tn = tp;
>>  	}
>> @@ -1007,7 +1034,12 @@ static struct node *trie_rebalance(struc
>>  	if (IS_TNODE(tn))
>>  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
>>  
>> -	return (struct node *)tn;
>> +	rcu_assign_pointer(t->trie, (struct node *)tn);
>> +	if (sync)
>> +		synchronize_rcu();
>> +	tnode_free_flush();
>> +
>> +	return;
>>  }
>>  
>>  /* only used from updater-side */
>> @@ -1155,7 +1187,7 @@ static struct list_head *fib_insert_node
>>  
>>  	/* Rebalance the trie */
>>  
>> -	rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
>> +	trie_rebalance(t, tp, true);
>>  done:
>>  	return fa_head;
>>  }
>> @@ -1575,7 +1607,7 @@ static void trie_leaf_remove(struct trie
>>  	if (tp) {
>>  		t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits);
>>  		put_child(t, (struct tnode *)tp, cindex, NULL);
>> -		rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
>> +		trie_rebalance(t, tp, false);
>>  	} else
>>  		rcu_assign_pointer(t->trie, NULL);
>>  
>>     
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>   


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-01  9:43                                     ` Paweł Staszewski
@ 2009-07-01  9:50                                       ` Paweł Staszewski
  2009-07-01 10:13                                       ` Jarek Poplawski
  1 sibling, 0 replies; 99+ messages in thread
From: Paweł Staszewski @ 2009-07-01  9:50 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Linux Network Development list, Robert Olsson

[-- Attachment #1: Type: text/plain, Size: 8578 bytes --]

Paweł Staszewski pisze:
> Jarek Poplawski pisze:
>> On Wed, Jul 01, 2009 at 06:36:51AM +0000, Jarek Poplawski wrote:
>>  
>>> On Wed, Jul 01, 2009 at 01:31:09AM +0200, Paweł Staszewski wrote:
>>> ...
>>>     
>>
>> It looks like Cc was shortened BTW, but I guess at least Robert is
>> interested in this testing, so I add him back.
>>
>> Cheers,
>> Jarek P.
>>
>>  
>>>> Yes i can make tests like this.
>>>> My network is splited to test clients and other normal clients
>>>> so it's really no problem to make testing. - if testing clients 
>>>> working  then traffic from normal clients is also switched to this 
>>>> router (but if  traffic is not forwarded "like in this case" for 
>>>> testing clients then  failover switching them to working router )
>>>>
>>>> and other point to make this tests - is that - it is good to have 
>>>> all in  linux kernel networking working well :)
>>>>       
>>> It's extremely nice of you! On the other hand, this type of change
>>> was planned to the net-next to fix possible memory problems, which
>>> might have happened to you as well. So you'd probably experience this
>>> problem in the future (2.6.32) anyway.
>>>
>>> So here is the first of 2 patches (the second in a separate message),
>>> which should be tested separately, each one applied on top of the
>>> 2.6.29.x (vanilla - at least fib_trie.c), after reverting the previous
>>> one. So, they are again all-in-one, to eclude any misunderstanding.
>>>
>>> Btw., I assume there were no oopses, warnings or lockups after those
>>> previous non-working patches - only no routing/forwarding.
>>>
>>>     
> Yes on on previous patches there was / no warnings / no oopses or lockups
>
> But now i apply this patch and i make more testing.
> First boot with start of bgpd and - traffic is not forwarded
> So i start to search and make only some routes (static without bgpd) 
> thru this host
> And all is working for this host when i make all by static routes.
>
> So i change a little my bgp configuration and make default route to 
> only one of my iBGP peers and start bgpd process
> All is working and what is weird is number of routes in kernel table.
> Kernel is learning routes from bgpd but very slowly - really very slowly.
>
> In attached file there are some fib_triestats after 5min of traffic.
>
> Without this patch (normally)
> total size: reported by fib_triestats in less that 1sec is:  "Total 
> size: 35769  kB"
>
> But with this patch
> Total size is growing up and in 5 min of traffic it grow to only:  
> "Total size: 1005  kB"
>
Sorry no attached file.

> Regards
> Paweł Staszewski
>
>>> Thanks,
>>> Jarek P.
>>> ----------> (synchronize take 6 all-in-one for 2.6.29x, .28, or .27)
>>>
>>> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>>> --- a/net/ipv4/fib_trie.c    2009-06-29 05:30:50.000000000 +0000
>>> +++ b/net/ipv4/fib_trie.c    2009-07-01 05:15:37.000000000 +0000
>>> @@ -123,6 +123,7 @@ struct tnode {
>>>      union {
>>>          struct rcu_head rcu;
>>>          struct work_struct work;
>>> +        struct tnode *tnode_free;
>>>      };
>>>      struct node *child[0];
>>>  };
>>> @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
>>>  static struct node *resize(struct trie *t, struct tnode *tn);
>>>  static struct tnode *inflate(struct trie *t, struct tnode *tn);
>>>  static struct tnode *halve(struct trie *t, struct tnode *tn);
>>> +/* tnodes to free after resize(); protected by RTNL */
>>> +static struct tnode *tnode_free_head;
>>>  
>>>  static struct kmem_cache *fn_alias_kmem __read_mostly;
>>>  static struct kmem_cache *trie_leaf_kmem __read_mostly;
>>> @@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
>>>          call_rcu(&tn->rcu, __tnode_free_rcu);
>>>  }
>>>  
>>> +static void tnode_free_safe(struct tnode *tn)
>>> +{
>>> +    BUG_ON(IS_LEAF(tn));
>>> +    tn->tnode_free = tnode_free_head;
>>> +    tnode_free_head = tn;
>>> +}
>>> +
>>> +static void tnode_free_flush(void)
>>> +{
>>> +    struct tnode *tn;
>>> +
>>> +    while ((tn = tnode_free_head)) {
>>> +        tnode_free_head = tn->tnode_free;
>>> +        tn->tnode_free = NULL;
>>> +        tnode_free(tn);
>>> +    }
>>> +}
>>> +
>>>  static struct leaf *leaf_new(void)
>>>  {
>>>      struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
>>> @@ -495,7 +516,7 @@ static struct node *resize(struct trie *
>>>  
>>>      /* No children */
>>>      if (tn->empty_children == tnode_child_length(tn)) {
>>> -        tnode_free(tn);
>>> +        tnode_free_safe(tn);
>>>          return NULL;
>>>      }
>>>      /* One child */
>>> @@ -509,7 +530,7 @@ static struct node *resize(struct trie *
>>>  
>>>              /* compress one level */
>>>              node_set_parent(n, NULL);
>>> -            tnode_free(tn);
>>> +            tnode_free_safe(tn);
>>>              return n;
>>>          }
>>>      /*
>>> @@ -670,7 +691,7 @@ static struct node *resize(struct trie *
>>>              /* compress one level */
>>>  
>>>              node_set_parent(n, NULL);
>>> -            tnode_free(tn);
>>> +            tnode_free_safe(tn);
>>>              return n;
>>>          }
>>>  
>>> @@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie
>>>              put_child(t, tn, 2*i, inode->child[0]);
>>>              put_child(t, tn, 2*i+1, inode->child[1]);
>>>  
>>> -            tnode_free(inode);
>>> +            tnode_free_safe(inode);
>>>              continue;
>>>          }
>>>  
>>> @@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie
>>>          put_child(t, tn, 2*i, resize(t, left));
>>>          put_child(t, tn, 2*i+1, resize(t, right));
>>>  
>>> -        tnode_free(inode);
>>> +        tnode_free_safe(inode);
>>>      }
>>> -    tnode_free(oldtnode);
>>> +    tnode_free_safe(oldtnode);
>>>      return tn;
>>>  nomem:
>>>      {
>>> @@ -885,7 +906,7 @@ static struct tnode *halve(struct trie *
>>>          put_child(t, newBinNode, 1, right);
>>>          put_child(t, tn, i/2, resize(t, newBinNode));
>>>      }
>>> -    tnode_free(oldtnode);
>>> +    tnode_free_safe(oldtnode);
>>>      return tn;
>>>  nomem:
>>>      {
>>> @@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key)
>>>      return NULL;
>>>  }
>>>  
>>> -static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
>>> +static void trie_rebalance(struct trie *t, struct tnode *tn, bool 
>>> sync)
>>>  {
>>>      int wasfull;
>>> -    t_key cindex, key = tn->key;
>>> +    t_key cindex, key;
>>>      struct tnode *tp;
>>>  
>>> +    key = tn->key;
>>> +
>>>      while (tn != NULL && (tp = node_parent((struct node *)tn)) != 
>>> NULL) {
>>>          cindex = tkey_extract_bits(key, tp->pos, tp->bits);
>>>          wasfull = tnode_full(tp, tnode_get_child(tp, cindex));
>>> @@ -999,6 +1022,10 @@ static struct node *trie_rebalance(struc
>>>  
>>>          tp = node_parent((struct node *) tn);
>>>          if (!tp)
>>> +            rcu_assign_pointer(t->trie, (struct node *)tn);
>>> +
>>> +        //tnode_free_flush();
>>> +        if (!tp)
>>>              break;
>>>          tn = tp;
>>>      }
>>> @@ -1007,7 +1034,12 @@ static struct node *trie_rebalance(struc
>>>      if (IS_TNODE(tn))
>>>          tn = (struct tnode *)resize(t, (struct tnode *)tn);
>>>  
>>> -    return (struct node *)tn;
>>> +    rcu_assign_pointer(t->trie, (struct node *)tn);
>>> +    if (sync)
>>> +        synchronize_rcu();
>>> +    tnode_free_flush();
>>> +
>>> +    return;
>>>  }
>>>  
>>>  /* only used from updater-side */
>>> @@ -1155,7 +1187,7 @@ static struct list_head *fib_insert_node
>>>  
>>>      /* Rebalance the trie */
>>>  
>>> -    rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
>>> +    trie_rebalance(t, tp, true);
>>>  done:
>>>      return fa_head;
>>>  }
>>> @@ -1575,7 +1607,7 @@ static void trie_leaf_remove(struct trie
>>>      if (tp) {
>>>          t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits);
>>>          put_child(t, (struct tnode *)tp, cindex, NULL);
>>> -        rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
>>> +        trie_rebalance(t, tp, false);
>>>      } else
>>>          rcu_assign_pointer(t->trie, NULL);
>>>  
>>>     
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>   
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


[-- Attachment #2: fib_triestats.txt --]
[-- Type: text/plain, Size: 1883 bytes --]

cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     3.79
        Max depth:      9
        Leaves:         15518
        Prefixes:       15933
        Internal nodes: 3973
          1: 2260  2: 674  3: 518  4: 268  5: 164  6: 79  7: 5  8: 2  9: 1  10: 2
        Pointers: 29664
Null ptrs: 10174
Total size: 995  kB

Counters:
---------
gets = 17863461
backtracks = 13345457
semantic match passed = 17305229
semantic match miss = 419
null node hit= 17602641
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 17865423
backtracks = 4964174
semantic match passed = 2126
semantic match miss = 0
null node hit= 1853
skipped node resize = 0


----------- After 30sec ----------------
cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     3.79
        Max depth:      9
        Leaves:         15686
        Prefixes:       16111
        Internal nodes: 4002
          1: 2259  2: 679  3: 536  4: 274  5: 165  6: 79  7: 5  8: 2  9: 1  10: 2
        Pointers: 29954
Null ptrs: 10267
Total size: 1005  kB

Counters:
---------
gets = 18042821
backtracks = 13523292
semantic match passed = 17484572
semantic match miss = 419
null node hit= 17799334
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 18044798
backtracks = 5012942
semantic match passed = 2140
semantic match miss = 0
null node hit= 1865
skipped node resize = 0


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-01  9:43                                     ` Paweł Staszewski
  2009-07-01  9:50                                       ` Paweł Staszewski
@ 2009-07-01 10:13                                       ` Jarek Poplawski
  2009-07-01 11:04                                         ` Jarek Poplawski
  1 sibling, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-01 10:13 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list, Robert Olsson

On Wed, Jul 01, 2009 at 11:43:04AM +0200, Paweł Staszewski wrote:
...
> Yes on on previous patches there was / no warnings / no oopses or lockups
>
> But now i apply this patch and i make more testing.
> First boot with start of bgpd and - traffic is not forwarded
> So i start to search and make only some routes (static without bgpd)  
> thru this host
> And all is working for this host when i make all by static routes.
>
> So i change a little my bgp configuration and make default route to only  
> one of my iBGP peers and start bgpd process
> All is working and what is weird is number of routes in kernel table.
> Kernel is learning routes from bgpd but very slowly - really very slowly.

Pawel, this is really very helpful! So, this is (probably) only about
timing, not wrong memory freeing. On the other hand this test was only
for inserts. Btw., if you didn't start the second test, you can skip
it. I have to rethink this.

Many thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-01 10:13                                       ` Jarek Poplawski
@ 2009-07-01 11:04                                         ` Jarek Poplawski
  2009-07-01 22:17                                           ` Paweł Staszewski
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-01 11:04 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list, Robert Olsson

On Wed, Jul 01, 2009 at 10:13:33AM +0000, Jarek Poplawski wrote:
> On Wed, Jul 01, 2009 at 11:43:04AM +0200, Paweł Staszewski wrote:
> ...
> > Yes on on previous patches there was / no warnings / no oopses or lockups
> >
> > But now i apply this patch and i make more testing.
> > First boot with start of bgpd and - traffic is not forwarded
> > So i start to search and make only some routes (static without bgpd)  
> > thru this host
> > And all is working for this host when i make all by static routes.
> >
> > So i change a little my bgp configuration and make default route to only  
> > one of my iBGP peers and start bgpd process
> > All is working and what is weird is number of routes in kernel table.
> > Kernel is learning routes from bgpd but very slowly - really very slowly.
> 
> Pawel, this is really very helpful! So, this is (probably) only about
> timing, not wrong memory freeing. On the other hand this test was only
> for inserts. Btw., if you didn't start the second test, you can skip
> it. I have to rethink this.

So, after your findings I'm about to recommend sending to -stable
3 patches from net-2.6, with additional lowering of threshold_root
settings, but it would be nice if you could give it a try with
CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break
your other apps!) It is expected to work this time...;-) Maybe a
bit slower.

Thanks,
Jarek P.
--------> (all-in-one preempt fixes to apply with vanilla 2.6.29.x)

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-07-01 06:17:08.000000000 +0000
+++ b/net/ipv4/fib_trie.c	2009-07-01 10:43:44.000000000 +0000
@@ -123,6 +123,7 @@ struct tnode {
 	union {
 		struct rcu_head rcu;
 		struct work_struct work;
+		struct tnode *tnode_free;
 	};
 	struct node *child[0];
 };
@@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
 static struct node *resize(struct trie *t, struct tnode *tn);
 static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
+/* tnodes to free after resize(); protected by RTNL */
+static struct tnode *tnode_free_head;
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -313,8 +316,8 @@ static inline void check_tnode(const str
 
 static const int halve_threshold = 25;
 static const int inflate_threshold = 50;
-static const int halve_threshold_root = 8;
-static const int inflate_threshold_root = 15;
+static const int halve_threshold_root = 15;
+static const int inflate_threshold_root = 25;
 
 
 static void __alias_free_mem(struct rcu_head *head)
@@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
 		call_rcu(&tn->rcu, __tnode_free_rcu);
 }
 
+static void tnode_free_safe(struct tnode *tn)
+{
+	BUG_ON(IS_LEAF(tn));
+	tn->tnode_free = tnode_free_head;
+	tnode_free_head = tn;
+}
+
+static void tnode_free_flush(void)
+{
+	struct tnode *tn;
+
+	while ((tn = tnode_free_head)) {
+		tnode_free_head = tn->tnode_free;
+		tn->tnode_free = NULL;
+		tnode_free(tn);
+	}
+}
+
 static struct leaf *leaf_new(void)
 {
 	struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
@@ -495,7 +516,7 @@ static struct node *resize(struct trie *
 
 	/* No children */
 	if (tn->empty_children == tnode_child_length(tn)) {
-		tnode_free(tn);
+		tnode_free_safe(tn);
 		return NULL;
 	}
 	/* One child */
@@ -509,7 +530,7 @@ static struct node *resize(struct trie *
 
 			/* compress one level */
 			node_set_parent(n, NULL);
-			tnode_free(tn);
+			tnode_free_safe(tn);
 			return n;
 		}
 	/*
@@ -670,7 +691,7 @@ static struct node *resize(struct trie *
 			/* compress one level */
 
 			node_set_parent(n, NULL);
-			tnode_free(tn);
+			tnode_free_safe(tn);
 			return n;
 		}
 
@@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie
 			put_child(t, tn, 2*i, inode->child[0]);
 			put_child(t, tn, 2*i+1, inode->child[1]);
 
-			tnode_free(inode);
+			tnode_free_safe(inode);
 			continue;
 		}
 
@@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie
 		put_child(t, tn, 2*i, resize(t, left));
 		put_child(t, tn, 2*i+1, resize(t, right));
 
-		tnode_free(inode);
+		tnode_free_safe(inode);
 	}
-	tnode_free(oldtnode);
+	tnode_free_safe(oldtnode);
 	return tn;
 nomem:
 	{
@@ -885,7 +906,7 @@ static struct tnode *halve(struct trie *
 		put_child(t, newBinNode, 1, right);
 		put_child(t, tn, i/2, resize(t, newBinNode));
 	}
-	tnode_free(oldtnode);
+	tnode_free_safe(oldtnode);
 	return tn;
 nomem:
 	{
@@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key)
 	return NULL;
 }
 
-static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
+static void trie_rebalance(struct trie *t, struct tnode *tn)
 {
 	int wasfull;
-	t_key cindex, key = tn->key;
+	t_key cindex, key;
 	struct tnode *tp;
 
+	key = tn->key;
+
 	while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) {
 		cindex = tkey_extract_bits(key, tp->pos, tp->bits);
 		wasfull = tnode_full(tp, tnode_get_child(tp, cindex));
@@ -999,6 +1022,10 @@ static struct node *trie_rebalance(struc
 
 		tp = node_parent((struct node *) tn);
 		if (!tp)
+			rcu_assign_pointer(t->trie, (struct node *)tn);
+
+		tnode_free_flush();
+		if (!tp)
 			break;
 		tn = tp;
 	}
@@ -1007,7 +1034,10 @@ static struct node *trie_rebalance(struc
 	if (IS_TNODE(tn))
 		tn = (struct tnode *)resize(t, (struct tnode *)tn);
 
-	return (struct node *)tn;
+	rcu_assign_pointer(t->trie, (struct node *)tn);
+	tnode_free_flush();
+
+	return;
 }
 
 /* only used from updater-side */
@@ -1155,7 +1185,7 @@ static struct list_head *fib_insert_node
 
 	/* Rebalance the trie */
 
-	rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
+	trie_rebalance(t, tp);
 done:
 	return fa_head;
 }
@@ -1575,7 +1605,7 @@ static void trie_leaf_remove(struct trie
 	if (tp) {
 		t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits);
 		put_child(t, (struct tnode *)tp, cindex, NULL);
-		rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
+		trie_rebalance(t, tp);
 	} else
 		rcu_assign_pointer(t->trie, NULL);
 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-01 11:04                                         ` Jarek Poplawski
@ 2009-07-01 22:17                                           ` Paweł Staszewski
  2009-07-02  5:32                                             ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-07-01 22:17 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Linux Network Development list, Robert Olsson

[-- Attachment #1: Type: text/plain, Size: 6762 bytes --]

Jarek Poplawski pisze:
> On Wed, Jul 01, 2009 at 10:13:33AM +0000, Jarek Poplawski wrote:
>   
>> On Wed, Jul 01, 2009 at 11:43:04AM +0200, Paweł Staszewski wrote:
>> ...
>>     
>>> Yes on on previous patches there was / no warnings / no oopses or lockups
>>>
>>> But now i apply this patch and i make more testing.
>>> First boot with start of bgpd and - traffic is not forwarded
>>> So i start to search and make only some routes (static without bgpd)  
>>> thru this host
>>> And all is working for this host when i make all by static routes.
>>>
>>> So i change a little my bgp configuration and make default route to only  
>>> one of my iBGP peers and start bgpd process
>>> All is working and what is weird is number of routes in kernel table.
>>> Kernel is learning routes from bgpd but very slowly - really very slowly.
>>>       
>> Pawel, this is really very helpful! So, this is (probably) only about
>> timing, not wrong memory freeing. On the other hand this test was only
>> for inserts. Btw., if you didn't start the second test, you can skip
>> it. I have to rethink this.
>>     
>
> So, after your findings I'm about to recommend sending to -stable
> 3 patches from net-2.6, with additional lowering of threshold_root
> settings, but it would be nice if you could give it a try with
> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break
> your other apps!) It is expected to work this time...;-) Maybe a
> bit slower.
>
>   
Patch applied to 2.6.29.5 with CONFIG_PREEMPT_NONE
And working :)

fib_triestats in attached file

I think I can test it with PREEMPT enabled but first i must make some 
other tests of my apps that are on server.

Regards
Paweł Staszewski

> Thanks,
> Jarek P.
> --------> (all-in-one preempt fixes to apply with vanilla 2.6.29.x)
>
> diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> --- a/net/ipv4/fib_trie.c	2009-07-01 06:17:08.000000000 +0000
> +++ b/net/ipv4/fib_trie.c	2009-07-01 10:43:44.000000000 +0000
> @@ -123,6 +123,7 @@ struct tnode {
>  	union {
>  		struct rcu_head rcu;
>  		struct work_struct work;
> +		struct tnode *tnode_free;
>  	};
>  	struct node *child[0];
>  };
> @@ -161,6 +162,8 @@ static void tnode_put_child_reorg(struct
>  static struct node *resize(struct trie *t, struct tnode *tn);
>  static struct tnode *inflate(struct trie *t, struct tnode *tn);
>  static struct tnode *halve(struct trie *t, struct tnode *tn);
> +/* tnodes to free after resize(); protected by RTNL */
> +static struct tnode *tnode_free_head;
>  
>  static struct kmem_cache *fn_alias_kmem __read_mostly;
>  static struct kmem_cache *trie_leaf_kmem __read_mostly;
> @@ -313,8 +316,8 @@ static inline void check_tnode(const str
>  
>  static const int halve_threshold = 25;
>  static const int inflate_threshold = 50;
> -static const int halve_threshold_root = 8;
> -static const int inflate_threshold_root = 15;
> +static const int halve_threshold_root = 15;
> +static const int inflate_threshold_root = 25;
>  
>  
>  static void __alias_free_mem(struct rcu_head *head)
> @@ -385,6 +388,24 @@ static inline void tnode_free(struct tno
>  		call_rcu(&tn->rcu, __tnode_free_rcu);
>  }
>  
> +static void tnode_free_safe(struct tnode *tn)
> +{
> +	BUG_ON(IS_LEAF(tn));
> +	tn->tnode_free = tnode_free_head;
> +	tnode_free_head = tn;
> +}
> +
> +static void tnode_free_flush(void)
> +{
> +	struct tnode *tn;
> +
> +	while ((tn = tnode_free_head)) {
> +		tnode_free_head = tn->tnode_free;
> +		tn->tnode_free = NULL;
> +		tnode_free(tn);
> +	}
> +}
> +
>  static struct leaf *leaf_new(void)
>  {
>  	struct leaf *l = kmem_cache_alloc(trie_leaf_kmem, GFP_KERNEL);
> @@ -495,7 +516,7 @@ static struct node *resize(struct trie *
>  
>  	/* No children */
>  	if (tn->empty_children == tnode_child_length(tn)) {
> -		tnode_free(tn);
> +		tnode_free_safe(tn);
>  		return NULL;
>  	}
>  	/* One child */
> @@ -509,7 +530,7 @@ static struct node *resize(struct trie *
>  
>  			/* compress one level */
>  			node_set_parent(n, NULL);
> -			tnode_free(tn);
> +			tnode_free_safe(tn);
>  			return n;
>  		}
>  	/*
> @@ -670,7 +691,7 @@ static struct node *resize(struct trie *
>  			/* compress one level */
>  
>  			node_set_parent(n, NULL);
> -			tnode_free(tn);
> +			tnode_free_safe(tn);
>  			return n;
>  		}
>  
> @@ -756,7 +777,7 @@ static struct tnode *inflate(struct trie
>  			put_child(t, tn, 2*i, inode->child[0]);
>  			put_child(t, tn, 2*i+1, inode->child[1]);
>  
> -			tnode_free(inode);
> +			tnode_free_safe(inode);
>  			continue;
>  		}
>  
> @@ -801,9 +822,9 @@ static struct tnode *inflate(struct trie
>  		put_child(t, tn, 2*i, resize(t, left));
>  		put_child(t, tn, 2*i+1, resize(t, right));
>  
> -		tnode_free(inode);
> +		tnode_free_safe(inode);
>  	}
> -	tnode_free(oldtnode);
> +	tnode_free_safe(oldtnode);
>  	return tn;
>  nomem:
>  	{
> @@ -885,7 +906,7 @@ static struct tnode *halve(struct trie *
>  		put_child(t, newBinNode, 1, right);
>  		put_child(t, tn, i/2, resize(t, newBinNode));
>  	}
> -	tnode_free(oldtnode);
> +	tnode_free_safe(oldtnode);
>  	return tn;
>  nomem:
>  	{
> @@ -983,12 +1004,14 @@ fib_find_node(struct trie *t, u32 key)
>  	return NULL;
>  }
>  
> -static struct node *trie_rebalance(struct trie *t, struct tnode *tn)
> +static void trie_rebalance(struct trie *t, struct tnode *tn)
>  {
>  	int wasfull;
> -	t_key cindex, key = tn->key;
> +	t_key cindex, key;
>  	struct tnode *tp;
>  
> +	key = tn->key;
> +
>  	while (tn != NULL && (tp = node_parent((struct node *)tn)) != NULL) {
>  		cindex = tkey_extract_bits(key, tp->pos, tp->bits);
>  		wasfull = tnode_full(tp, tnode_get_child(tp, cindex));
> @@ -999,6 +1022,10 @@ static struct node *trie_rebalance(struc
>  
>  		tp = node_parent((struct node *) tn);
>  		if (!tp)
> +			rcu_assign_pointer(t->trie, (struct node *)tn);
> +
> +		tnode_free_flush();
> +		if (!tp)
>  			break;
>  		tn = tp;
>  	}
> @@ -1007,7 +1034,10 @@ static struct node *trie_rebalance(struc
>  	if (IS_TNODE(tn))
>  		tn = (struct tnode *)resize(t, (struct tnode *)tn);
>  
> -	return (struct node *)tn;
> +	rcu_assign_pointer(t->trie, (struct node *)tn);
> +	tnode_free_flush();
> +
> +	return;
>  }
>  
>  /* only used from updater-side */
> @@ -1155,7 +1185,7 @@ static struct list_head *fib_insert_node
>  
>  	/* Rebalance the trie */
>  
> -	rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
> +	trie_rebalance(t, tp);
>  done:
>  	return fa_head;
>  }
> @@ -1575,7 +1605,7 @@ static void trie_leaf_remove(struct trie
>  	if (tp) {
>  		t_key cindex = tkey_extract_bits(l->key, tp->pos, tp->bits);
>  		put_child(t, (struct tnode *)tp, cindex, NULL);
> -		rcu_assign_pointer(t->trie, trie_rebalance(t, tp));
> +		trie_rebalance(t, tp);
>  	} else
>  		rcu_assign_pointer(t->trie, NULL);
>  
>
>
>   


[-- Attachment #2: fib_triestats.txt --]
[-- Type: text/plain, Size: 925 bytes --]

cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.44
        Max depth:      6
        Leaves:         277395
        Prefixes:       290874
        Internal nodes: 66711
          1: 32915  2: 14668  3: 10752  4: 4913  5: 2197  6: 895  7: 367  8: 3  17: 1
        Pointers: 595526
Null ptrs: 251421
Total size: 18044  kB

Counters:
---------
gets = 2705388
backtracks = 137797
semantic match passed = 2658993
semantic match miss = 87
null node hit= 1980950
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 2709741
backtracks = 1584810
semantic match passed = 4417
semantic match miss = 0
null node hit= 192688
skipped node resize = 0

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-01 22:17                                           ` Paweł Staszewski
@ 2009-07-02  5:32                                             ` Jarek Poplawski
  2009-07-02  5:43                                               ` Paweł Staszewski
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-02  5:32 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list, Robert Olsson

On Thu, Jul 02, 2009 at 12:17:19AM +0200, Paweł Staszewski wrote:
> Jarek Poplawski pisze:
...
>> So, after your findings I'm about to recommend sending to -stable
>> 3 patches from net-2.6, with additional lowering of threshold_root
>> settings, but it would be nice if you could give it a try with
>> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break
>> your other apps!) It is expected to work this time...;-) Maybe a
>> bit slower.
>>
>>   
> Patch applied to 2.6.29.5 with CONFIG_PREEMPT_NONE
> And working :)

Hmm... It should, because you tested very similar patch already;-)
Sorry if I didn't make it clear.

>
> fib_triestats in attached file
>
> I think I can test it with PREEMPT enabled but first i must make some  
> other tests of my apps that are on server.

It could probably matter only if you're using some broken out-of-tree
patches. Otherwise the kernel is expected to work OK.

Btw., it would be also interesting to check if there is any difference
wrt. these route cache problems while PREEMPT is enabled.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-02  5:32                                             ` Jarek Poplawski
@ 2009-07-02  5:43                                               ` Paweł Staszewski
  2009-07-02  6:00                                                 ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-07-02  5:43 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Linux Network Development list, Robert Olsson

Jarek Poplawski pisze:
> On Thu, Jul 02, 2009 at 12:17:19AM +0200, Paweł Staszewski wrote:
>   
>> Jarek Poplawski pisze:
>>     
> ...
>   
>>> So, after your findings I'm about to recommend sending to -stable
>>> 3 patches from net-2.6, with additional lowering of threshold_root
>>> settings, but it would be nice if you could give it a try with
>>> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break
>>> your other apps!) It is expected to work this time...;-) Maybe a
>>> bit slower.
>>>
>>>   
>>>       
>> Patch applied to 2.6.29.5 with CONFIG_PREEMPT_NONE
>> And working :)
>>     
>
> Hmm... It should, because you tested very similar patch already;-)
> Sorry if I didn't make it clear.
>
>   
Yes i know there was almost identical one.
And i see this was without sync rcu :)

>> fib_triestats in attached file
>>
>> I think I can test it with PREEMPT enabled but first i must make some  
>> other tests of my apps that are on server.
>>     
>
> It could probably matter only if you're using some broken out-of-tree
> patches. Otherwise the kernel is expected to work OK.
>
>   
Im a little confused about using of PREEMPT kernel because of past
there was many oopses / lockups :) but yes that was a little long time ago.
I will try to make this test today.

> Btw., it would be also interesting to check if there is any difference
> wrt. these route cache problems while PREEMPT is enabled.
>
> Thanks,
> Jarek P.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>   


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-02  5:43                                               ` Paweł Staszewski
@ 2009-07-02  6:00                                                 ` Jarek Poplawski
  2009-07-02 15:31                                                   ` Robert Olsson
  2009-07-05  0:26                                                   ` Paweł Staszewski
  0 siblings, 2 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-02  6:00 UTC (permalink / raw)
  To: Paweł Staszewski; +Cc: Linux Network Development list, Robert Olsson

On Thu, Jul 02, 2009 at 07:43:25AM +0200, Paweł Staszewski wrote:
> Jarek Poplawski pisze:
>> On Thu, Jul 02, 2009 at 12:17:19AM +0200, Paweł Staszewski wrote:
>>   
>>> Jarek Poplawski pisze:
>>>     
>> ...
>>   
>>>> So, after your findings I'm about to recommend sending to -stable
>>>> 3 patches from net-2.6, with additional lowering of threshold_root
>>>> settings, but it would be nice if you could give it a try with
>>>> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break
>>>> your other apps!) It is expected to work this time...;-) Maybe a
>>>> bit slower.
>>>>
>>>>         
>>> Patch applied to 2.6.29.5 with CONFIG_PREEMPT_NONE
>>> And working :)
>>>     
>>
>> Hmm... It should, because you tested very similar patch already;-)
>> Sorry if I didn't make it clear.
>>
>>   
> Yes i know there was almost identical one.
> And i see this was without sync rcu :)

Yes, it looks like we can't free memory so simple because of such huge
latencies.  

>
>>> fib_triestats in attached file
>>>
>>> I think I can test it with PREEMPT enabled but first i must make some 
>>>  other tests of my apps that are on server.
>>>     
>>
>> It could probably matter only if you're using some broken out-of-tree
>> patches. Otherwise the kernel is expected to work OK.
>>
>>   
> Im a little confused about using of PREEMPT kernel because of past
> there was many oopses / lockups :) but yes that was a little long time ago.
> I will try to make this test today.
>
>> Btw., it would be also interesting to check if there is any difference
>> wrt. these route cache problems while PREEMPT is enabled.

And you're very right! The place we're fixing is the best example. On
the other hand, I hope there is not many such places yet. But if we
test/fix it there will be one less...

Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-02  6:00                                                 ` Jarek Poplawski
@ 2009-07-02 15:31                                                   ` Robert Olsson
  2009-07-02 19:06                                                     ` Jarek Poplawski
  2009-07-05  0:26                                                   ` Paweł Staszewski
  1 sibling, 1 reply; 99+ messages in thread
From: Robert Olsson @ 2009-07-02 15:31 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Paweł Staszewski, Linux Network Development list, Robert Olsson


Jarek Poplawski writes:

 > Yes, it looks like we can't free memory so simple because of such huge
 > latencies.  

 Controlling RCU seems crucial. Insertion of the full BGP table increased
 from 2 seconds to > 20 min with one synchronize_rcu patches.

 And fib_trie "worst case" wrt memory is the root node. So maybe we should 
 monitor changes in root node and use this to control synchronize_rcu.

 Didn't Paul suggest something like this?

 And with don't find any decent solution we have to add an option for 
 a fixed and pre-allocated root-nod typically for BGP-routers.

 Cheers
					--ro

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-02 15:31                                                   ` Robert Olsson
@ 2009-07-02 19:06                                                     ` Jarek Poplawski
  2009-07-02 21:32                                                       ` Robert Olsson
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-02 19:06 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Paweł Staszewski, Linux Network Development list, Robert Olsson

On Thu, Jul 02, 2009 at 05:31:58PM +0200, Robert Olsson wrote:
> 
> Jarek Poplawski writes:
> 
>  > Yes, it looks like we can't free memory so simple because of such huge
>  > latencies.  
> 
>  Controlling RCU seems crucial. Insertion of the full BGP table increased
>  from 2 seconds to > 20 min with one synchronize_rcu patches.

I wish I knew this a few days before. I could imagine a slow down,
but it looked like it was stuck. Since these last changes weren't
tested on SMP + PREEMPT I thought there is still something broken.
(I was mainly interested in this synchronize_rcu at the moment as
a preemption test.)  

>  And fib_trie "worst case" wrt memory is the root node. So maybe we should 
>  monitor changes in root node and use this to control synchronize_rcu.
> 
>  Didn't Paul suggest something like this?

Sure, and it needs testing, but we should send some safe preemption
fix for -stable first, don't we?

>  And with don't find any decent solution we have to add an option for 
>  a fixed and pre-allocated root-nod typically for BGP-routers.

Probably you're right; I'd prefer to see the test results showing
a difference vs. simply less aggressive root thresholds. But of
course, even if not convinced, I'll respect your choice as the author
and maintainer, so feel free to NAK my proposals - I won't get it
personally.;-)

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-02 19:06                                                     ` Jarek Poplawski
@ 2009-07-02 21:32                                                       ` Robert Olsson
  2009-07-02 22:13                                                         ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Robert Olsson @ 2009-07-02 21:32 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Robert Olsson, Paweł Staszewski,
	Linux Network Development list, Robert Olsson


Jarek Poplawski writes:

 > >  Controlling RCU seems crucial. Insertion of the full BGP table increased
 > >  from 2 seconds to > 20 min with one synchronize_rcu patches.
 > 
 > I wish I knew this a few days before. I could imagine a slow down,
 > but it looked like it was stuck. Since these last changes weren't
 > tested on SMP + PREEMPT I thought there is still something broken.
 > (I was mainly interested in this synchronize_rcu at the moment as
 > a preemption test.)  


 Honestly this huge slowdown was surprise for me too. I think I sent 
 you a script so you could insert the full table yourself.

 > >  And fib_trie "worst case" wrt memory is the root node. So maybe we should 
 > >  monitor changes in root node and use this to control synchronize_rcu.
 > > 
 > >  Didn't Paul suggest something like this?
 > 
 > Sure, and it needs testing, but we should send some safe preemption
 > fix for -stable first, don't we?
 
 Yes my hope was that we could combine them... personally I'll need 
 to understand who we can preeemted better in the different configs
 and most of that this can be handled by "standard" RCU.

 > >  And with don't find any decent solution we have to add an option for 
 > >  a fixed and pre-allocated root-nod typically for BGP-routers.
 > 
 > Probably you're right; I'd prefer to see the test results showing
 > a difference vs. simply less aggressive root thresholds. But of
 > course, even if not convinced, I'll respect your choice as the author
 > and maintainer, so feel free to NAK my proposals - I won't get it
 > personally.;-)

 Thresholds we can change no problem... but very soon I'll people 
 will start routing without the route cache this at least in close
 to Internet core ,we will need all fib_look performance we can get.

 fib_trie was designed for classical RCU and no preempt you see the
 names i file... so this new and very challenging work to all of us.
 
 First week of vacation and have to fix the roof of the house...
 it's hot and dirty. 

 Cheers.
					--ro

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-02 21:32                                                       ` Robert Olsson
@ 2009-07-02 22:13                                                         ` Jarek Poplawski
  0 siblings, 0 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-02 22:13 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Paweł Staszewski, Linux Network Development list, Robert Olsson

On Thu, Jul 02, 2009 at 11:32:26PM +0200, Robert Olsson wrote:
> 
> Jarek Poplawski writes:
> 
>  > >  Controlling RCU seems crucial. Insertion of the full BGP table increased
>  > >  from 2 seconds to > 20 min with one synchronize_rcu patches.
>  > 
>  > I wish I knew this a few days before. I could imagine a slow down,
>  > but it looked like it was stuck. Since these last changes weren't
>  > tested on SMP + PREEMPT I thought there is still something broken.
>  > (I was mainly interested in this synchronize_rcu at the moment as
>  > a preemption test.)  
> 
> 
>  Honestly this huge slowdown was surprise for me too. I think I sent 
>  you a script so you could insert the full table yourself.

I can't remember this script, but I guess my hardware should be
suitable for reading it.;-)

> 
>  > >  And fib_trie "worst case" wrt memory is the root node. So maybe we should 
>  > >  monitor changes in root node and use this to control synchronize_rcu.
>  > > 
>  > >  Didn't Paul suggest something like this?
>  > 
>  > Sure, and it needs testing, but we should send some safe preemption
>  > fix for -stable first, don't we?
>  
>  Yes my hope was that we could combine them... personally I'll need 
>  to understand who we can preeemted better in the different configs
>  and most of that this can be handled by "standard" RCU.
> 
>  > >  And with don't find any decent solution we have to add an option for 
>  > >  a fixed and pre-allocated root-nod typically for BGP-routers.
>  > 
>  > Probably you're right; I'd prefer to see the test results showing
>  > a difference vs. simply less aggressive root thresholds. But of
>  > course, even if not convinced, I'll respect your choice as the author
>  > and maintainer, so feel free to NAK my proposals - I won't get it
>  > personally.;-)
> 
>  Thresholds we can change no problem... but very soon I'll people 
>  will start routing without the route cache this at least in close
>  to Internet core ,we will need all fib_look performance we can get.

I mean changing thresholds as a temporary solution, until we can
control memory freeing; and it seems to me, even excluding the root
node, there could be a lot of temporary allocations during all those
cycles repeated 10 times.

> 
>  fib_trie was designed for classical RCU and no preempt you see the
>  names i file... so this new and very challenging work to all of us.

Then it should depend on CONFIG_PREEMPT_NONE, I guess.

>  
>  First week of vacation and have to fix the roof of the house...
>  it's hot and dirty. 

Have a nice time,
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-02  6:00                                                 ` Jarek Poplawski
  2009-07-02 15:31                                                   ` Robert Olsson
@ 2009-07-05  0:26                                                   ` Paweł Staszewski
  2009-07-05  0:30                                                     ` Paweł Staszewski
                                                                       ` (3 more replies)
  1 sibling, 4 replies; 99+ messages in thread
From: Paweł Staszewski @ 2009-07-05  0:26 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Linux Network Development list, Robert Olsson

Jarek Poplawski pisze:
> On Thu, Jul 02, 2009 at 07:43:25AM +0200, Paweł Staszewski wrote:
>   
>> Jarek Poplawski pisze:
>>     
>>> On Thu, Jul 02, 2009 at 12:17:19AM +0200, Paweł Staszewski wrote:
>>>   
>>>       
>>>> Jarek Poplawski pisze:
>>>>     
>>>>         
>>> ...
>>>   
>>>       
>>>>> So, after your findings I'm about to recommend sending to -stable
>>>>> 3 patches from net-2.6, with additional lowering of threshold_root
>>>>> settings, but it would be nice if you could give it a try with
>>>>> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break
>>>>> your other apps!) It is expected to work this time...;-) Maybe a
>>>>> bit slower.
>>>>>
>>>>>   
Ok kernel configured with CONFIG_PREEMPT
and all this day work without any problems (with Jarek last patch).


So in attached file trere is fib_tirestats
I dont see any big change of (cpu load or faster/slower 
routing/propagating routes from bgpd or something else) - in avg there 
is from 2% to 3% more of CPU load i dont know why but it is - i change
from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL 
1 30"
always avg cpu load was from 2 to 3% more compared to "no preempt"

Regards
Paweł Staszewski

 
>>>>>       
>>>>>           
>>>> Patch applied to 2.6.29.5 with CONFIG_PREEMPT_NONE
>>>> And working :)
>>>>     
>>>>         
>>> Hmm... It should, because you tested very similar patch already;-)
>>> Sorry if I didn't make it clear.
>>>
>>>   
>>>       
>> Yes i know there was almost identical one.
>> And i see this was without sync rcu :)
>>     
>
> Yes, it looks like we can't free memory so simple because of such huge
> latencies.  
>
>   
>>>> fib_triestats in attached file
>>>>
>>>> I think I can test it with PREEMPT enabled but first i must make some 
>>>>  other tests of my apps that are on server.
>>>>     
>>>>         
>>> It could probably matter only if you're using some broken out-of-tree
>>> patches. Otherwise the kernel is expected to work OK.
>>>
>>>   
>>>       
>> Im a little confused about using of PREEMPT kernel because of past
>> there was many oopses / lockups :) but yes that was a little long time ago.
>> I will try to make this test today.
>>
>>     
>>> Btw., it would be also interesting to check if there is any difference
>>> wrt. these route cache problems while PREEMPT is enabled.
>>>       
>
> And you're very right! The place we're fixing is the best example. On
> the other hand, I hope there is not many such places yet. But if we
> test/fix it there will be one less...
>
> Jarek P.
>
>
>   


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-05  0:26                                                   ` Paweł Staszewski
@ 2009-07-05  0:30                                                     ` Paweł Staszewski
  2009-07-05 16:20                                                       ` Jarek Poplawski
  2009-07-05  0:31                                                     ` [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits Paweł Staszewski
                                                                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-07-05  0:30 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Linux Network Development list, Robert Olsson

Oh

I forgot - please Jarek give me patch with sync rcu and i will make test 
on preempt kernel

Thanks
Paweł Staszewski

Paweł Staszewski pisze:
> Jarek Poplawski pisze:
>> On Thu, Jul 02, 2009 at 07:43:25AM +0200, Paweł Staszewski wrote:
>>  
>>> Jarek Poplawski pisze:
>>>    
>>>> On Thu, Jul 02, 2009 at 12:17:19AM +0200, Paweł Staszewski wrote:
>>>>        
>>>>> Jarek Poplawski pisze:
>>>>>             
>>>> ...
>>>>        
>>>>>> So, after your findings I'm about to recommend sending to -stable
>>>>>> 3 patches from net-2.6, with additional lowering of threshold_root
>>>>>> settings, but it would be nice if you could give it a try with
>>>>>> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break
>>>>>> your other apps!) It is expected to work this time...;-) Maybe a
>>>>>> bit slower.
>>>>>>
>>>>>>   
> Ok kernel configured with CONFIG_PREEMPT
> and all this day work without any problems (with Jarek last patch).
>
>
> So in attached file trere is fib_tirestats
> I dont see any big change of (cpu load or faster/slower 
> routing/propagating routes from bgpd or something else) - in avg there 
> is from 2% to 3% more of CPU load i dont know why but it is - i change
> from "preempt" to "no preempt" 3 times and check this my "mpstat -P 
> ALL 1 30"
> always avg cpu load was from 2 to 3% more compared to "no preempt"
>
> Regards
> Paweł Staszewski
>
>
>>>>>>                 
>>>>> Patch applied to 2.6.29.5 with CONFIG_PREEMPT_NONE
>>>>> And working :)
>>>>>             
>>>> Hmm... It should, because you tested very similar patch already;-)
>>>> Sorry if I didn't make it clear.
>>>>
>>>>         
>>> Yes i know there was almost identical one.
>>> And i see this was without sync rcu :)
>>>     
>>
>> Yes, it looks like we can't free memory so simple because of such huge
>> latencies. 
>>  
>>>>> fib_triestats in attached file
>>>>>
>>>>> I think I can test it with PREEMPT enabled but first i must make 
>>>>> some  other tests of my apps that are on server.
>>>>>             
>>>> It could probably matter only if you're using some broken out-of-tree
>>>> patches. Otherwise the kernel is expected to work OK.
>>>>
>>>>         
>>> Im a little confused about using of PREEMPT kernel because of past
>>> there was many oopses / lockups :) but yes that was a little long 
>>> time ago.
>>> I will try to make this test today.
>>>
>>>    
>>>> Btw., it would be also interesting to check if there is any difference
>>>> wrt. these route cache problems while PREEMPT is enabled.
>>>>       
>>
>> And you're very right! The place we're fixing is the best example. On
>> the other hand, I hope there is not many such places yet. But if we
>> test/fix it there will be one less...
>>
>> Jarek P.
>>
>>
>>   
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-05  0:26                                                   ` Paweł Staszewski
  2009-07-05  0:30                                                     ` Paweł Staszewski
@ 2009-07-05  0:31                                                     ` Paweł Staszewski
  2009-07-05 12:56                                                     ` [PATCH -stable] " Jarek Poplawski
  2009-07-05 13:08                                                     ` [PATCH v2 " Jarek Poplawski
  3 siblings, 0 replies; 99+ messages in thread
From: Paweł Staszewski @ 2009-07-05  0:31 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Linux Network Development list, Robert Olsson

[-- Attachment #1: Type: text/plain, Size: 2866 bytes --]

Sorry again no attachement.




Paweł Staszewski pisze:
> Jarek Poplawski pisze:
>> On Thu, Jul 02, 2009 at 07:43:25AM +0200, Paweł Staszewski wrote:
>>  
>>> Jarek Poplawski pisze:
>>>    
>>>> On Thu, Jul 02, 2009 at 12:17:19AM +0200, Paweł Staszewski wrote:
>>>>        
>>>>> Jarek Poplawski pisze:
>>>>>             
>>>> ...
>>>>        
>>>>>> So, after your findings I'm about to recommend sending to -stable
>>>>>> 3 patches from net-2.6, with additional lowering of threshold_root
>>>>>> settings, but it would be nice if you could give it a try with
>>>>>> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break
>>>>>> your other apps!) It is expected to work this time...;-) Maybe a
>>>>>> bit slower.
>>>>>>
>>>>>>   
> Ok kernel configured with CONFIG_PREEMPT
> and all this day work without any problems (with Jarek last patch).
>
>
> So in attached file trere is fib_tirestats
> I dont see any big change of (cpu load or faster/slower 
> routing/propagating routes from bgpd or something else) - in avg there 
> is from 2% to 3% more of CPU load i dont know why but it is - i change
> from "preempt" to "no preempt" 3 times and check this my "mpstat -P 
> ALL 1 30"
> always avg cpu load was from 2 to 3% more compared to "no preempt"
>
> Regards
> Paweł Staszewski
>
>
>>>>>>                 
>>>>> Patch applied to 2.6.29.5 with CONFIG_PREEMPT_NONE
>>>>> And working :)
>>>>>             
>>>> Hmm... It should, because you tested very similar patch already;-)
>>>> Sorry if I didn't make it clear.
>>>>
>>>>         
>>> Yes i know there was almost identical one.
>>> And i see this was without sync rcu :)
>>>     
>>
>> Yes, it looks like we can't free memory so simple because of such huge
>> latencies. 
>>  
>>>>> fib_triestats in attached file
>>>>>
>>>>> I think I can test it with PREEMPT enabled but first i must make 
>>>>> some  other tests of my apps that are on server.
>>>>>             
>>>> It could probably matter only if you're using some broken out-of-tree
>>>> patches. Otherwise the kernel is expected to work OK.
>>>>
>>>>         
>>> Im a little confused about using of PREEMPT kernel because of past
>>> there was many oopses / lockups :) but yes that was a little long 
>>> time ago.
>>> I will try to make this test today.
>>>
>>>    
>>>> Btw., it would be also interesting to check if there is any difference
>>>> wrt. these route cache problems while PREEMPT is enabled.
>>>>       
>>
>> And you're very right! The place we're fixing is the best example. On
>> the other hand, I hope there is not many such places yet. But if we
>> test/fix it there will be one less...
>>
>> Jarek P.
>>
>>
>>   
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


[-- Attachment #2: fib_triestats.txt --]
[-- Type: text/plain, Size: 929 bytes --]

cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.44
        Max depth:      6
        Leaves:         277814
        Prefixes:       291306
        Internal nodes: 66420
          1: 32737  2: 14850  3: 10332  4: 4871  5: 2313  6: 942  7: 371  8: 3  17: 1
        Pointers: 599098
Null ptrs: 254865
Total size: 18067  kB

Counters:
---------
gets = 2003686
backtracks = 78789
semantic match passed = 1977687
semantic match miss = 112
null node hit= 1470619
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 2008497
backtracks = 1417179
semantic match passed = 4823
semantic match miss = 0
null node hit= 197044
skipped node resize = 0





^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH -stable] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-05  0:26                                                   ` Paweł Staszewski
  2009-07-05  0:30                                                     ` Paweł Staszewski
  2009-07-05  0:31                                                     ` [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits Paweł Staszewski
@ 2009-07-05 12:56                                                     ` Jarek Poplawski
  2009-07-05 13:08                                                     ` [PATCH v2 " Jarek Poplawski
  3 siblings, 0 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-05 12:56 UTC (permalink / raw)
  To: David Miller
  Cc: Paweł Staszewski, Linux Network Development list,
	Robert Olsson, Jorge Boncompte [DTI2]

David & Robert,
below are my recommendations for -stable plus one more patch:

On Sun, Jul 05, 2009 at 02:26:54AM +0200, Paweł Staszewski wrote:
...
>>>>> Jarek Poplawski pisze:
...
>>>>>> So, after your findings I'm about to recommend sending to -stable
>>>>>> 3 patches from net-2.6, with additional lowering of threshold_root
>>>>>> settings, but it would be nice if you could give it a try with
>>>>>> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break
>>>>>> your other apps!) It is expected to work this time...;-) Maybe a
>>>>>> bit slower.
>>>>>>
>>>>>>   
> Ok kernel configured with CONFIG_PREEMPT
> and all this day work without any problems (with Jarek last patch).
>
>
> So in attached file trere is fib_tirestats
> I dont see any big change of (cpu load or faster/slower  
> routing/propagating routes from bgpd or something else) - in avg there  
> is from 2% to 3% more of CPU load i dont know why but it is - i change
> from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL  
> 1 30"
> always avg cpu load was from 2 to 3% more compared to "no preempt"
>
> Regards
> Paweł Staszewski

So after these patches from net-2.6 are tested both for PREEMPT and
PREEMPT_NONE I think they should go to -stable:

2.6.30 needs:
-------------

commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Mon Jun 15 02:31:29 2009 -0700

    ipv4: Fix fib_trie rebalancing

commit 7b85576d15bf2574b0a451108f59f9ad4170dd3f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Thu Jun 18 00:28:51 2009 -0700

    ipv4: Fix fib_trie rebalancing, part 2

commit 008440e3ad4b72f5048d1b1f6f5ed894fdc5ad08
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Tue Jun 30 12:47:19 2009 -0700

    ipv4: Fix fib_trie rebalancing, part 3

plus the new patch below

    ipv4: Fix fib_trie rebalancing, part 4 (root thresholds)

2.6.29 needs:
-------------

this patch from 2.6.30:
commit 3ed18d76d959e5cbfa5d70c8f7ba95476582a556
Author: Robert Olsson <robert.olsson@its.uu.se>
Date:   Thu May 21 15:20:59 2009 -0700

    ipv4: Fix oops with FIB_TRIE

plus above mentionned patches for 2.6.30 (part 1 - 4)

-----------------

David, if possible, please add to all these "Fix... part 1 - 4":

Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>

This new patch below is intended only for -stable (and later for
net-next), because it doesn't meet rules of the current -rc. Anyway,
it's not critical (but it actually fixes a regression from 2.6.22).

Thanks,
Jarek P.
---------------->
ipv4: Fix fib_trie rebalancing, part 4 (root thresholds)

Pawel Staszewski wrote:
<blockquote>
Some time ago i report this:
http://bugzilla.kernel.org/show_bug.cgi?id=6648

and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
dmesg output:
oprofile: using NMI interrupt.
Fix inflate_threshold_root. Now=15 size=11 bits
...
Fix inflate_threshold_root. Now=15 size=11 bits

cat /proc/net/fib_triestat
Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes.
Main:
        Aver depth:     2.28
        Max depth:      6
        Leaves:         276539
        Prefixes:       289922
        Internal nodes: 66762
          1: 35046  2: 13824  3: 9508  4: 4897  5: 2331  6: 1149  7: 5  
9: 1  18: 1
        Pointers: 691228
Null ptrs: 347928
Total size: 35709  kB
</blockquote>

It seems, the current threshold for root resizing is too aggressive,
and it causes misleading warnings during big updates, but it might be
also responsible for memory problems, especially with non-preempt
configs, when RCU freeing is delayed long after call_rcu.

It should be also mentionned that because of non-atomic changes during
resizing/rebalancing the current lookup algorithm can miss valid leafs
so it's additional argument to shorten these activities even at a cost
of a minimally longer searching.

This patch restores values before the patch "[IPV4]: fib_trie root
node settings", commit: 965ffea43d4ebe8cd7b9fee78d651268dd7d23c5 from
v2.6.22.

Pawel's report:
<blockquote>
I dont see any big change of (cpu load or faster/slower 
routing/propagating routes from bgpd or something else) - in avg there 
is from 2% to 3% more of CPU load i dont know why but it is - i change
from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL 
1 30"
always avg cpu load was from 2 to 3% more compared to "no preempt"
[...]
cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.44
        Max depth:      6
        Leaves:         277814
        Prefixes:       291306
        Internal nodes: 66420
          1: 32737  2: 14850  3: 10332  4: 4871  5: 2313  6: 942  7: 371  8: 3  17: 1
        Pointers: 599098
Null ptrs: 254865
Total size: 18067  kB
</blockquote>

According to this and other similar reports average depth is slightly
increased (~0.2), and root nodes are shorter (log 17 vs. 18), but
there is no visible performance decrease. So, until memory handling is
improved or added parameters for changing this individually, this
patch resets to safer defaults.

Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
---

 net/ipv4/fib_trie.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 00a54b2..63c2fa7 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -316,8 +316,8 @@ static inline void check_tnode(const struct tnode *tn)
 
 static const int halve_threshold = 25;
 static const int inflate_threshold = 50;
-static const int halve_threshold_root = 8;
-static const int inflate_threshold_root = 15;
+static const int halve_threshold_root = 15;
+static const int inflate_threshold_root = 25;
 
 
 static void __alias_free_mem(struct rcu_head *head)

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH v2 -stable] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-05  0:26                                                   ` Paweł Staszewski
                                                                       ` (2 preceding siblings ...)
  2009-07-05 12:56                                                     ` [PATCH -stable] " Jarek Poplawski
@ 2009-07-05 13:08                                                     ` Jarek Poplawski
  2009-07-08  2:42                                                       ` David Miller
  3 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-05 13:08 UTC (permalink / raw)
  To: David Miller
  Cc: Paweł Staszewski, Linux Network Development list,
	Robert Olsson, Jorge Boncompte [DTI2]

(Take 2: Changelog spelling fixes, sorry.)

David & Robert,
below are my recommendations for -stable plus one more patch:

On Sun, Jul 05, 2009 at 02:26:54AM +0200, Paweł Staszewski wrote:
...
>>>>> Jarek Poplawski pisze:
...
>>>>>> So, after your findings I'm about to recommend sending to -stable
>>>>>> 3 patches from net-2.6, with additional lowering of threshold_root
>>>>>> settings, but it would be nice if you could give it a try with
>>>>>> CONFIG_PREEMPT instead of CONFIG_PREEMPT_NONE (if it doesn't break
>>>>>> your other apps!) It is expected to work this time...;-) Maybe a
>>>>>> bit slower.
>>>>>>
>>>>>>   
> Ok kernel configured with CONFIG_PREEMPT
> and all this day work without any problems (with Jarek last patch).
>
>
> So in attached file trere is fib_tirestats
> I dont see any big change of (cpu load or faster/slower  
> routing/propagating routes from bgpd or something else) - in avg there  
> is from 2% to 3% more of CPU load i dont know why but it is - i change
> from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL  
> 1 30"
> always avg cpu load was from 2 to 3% more compared to "no preempt"
>
> Regards
> Paweł Staszewski

So after these patches from net-2.6 are tested both for PREEMPT and
PREEMPT_NONE I think they should go to -stable:

2.6.30 needs:
-------------

commit e0f7cb8c8cc6cccce28d2ce39ad8c60d23c3799f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Mon Jun 15 02:31:29 2009 -0700

    ipv4: Fix fib_trie rebalancing

commit 7b85576d15bf2574b0a451108f59f9ad4170dd3f
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Thu Jun 18 00:28:51 2009 -0700

    ipv4: Fix fib_trie rebalancing, part 2

commit 008440e3ad4b72f5048d1b1f6f5ed894fdc5ad08
Author: Jarek Poplawski <jarkao2@gmail.com>
Date:   Tue Jun 30 12:47:19 2009 -0700

    ipv4: Fix fib_trie rebalancing, part 3

plus the new patch below

    ipv4: Fix fib_trie rebalancing, part 4 (root thresholds)

2.6.29 needs:
-------------

this patch from 2.6.30:
commit 3ed18d76d959e5cbfa5d70c8f7ba95476582a556
Author: Robert Olsson <robert.olsson@its.uu.se>
Date:   Thu May 21 15:20:59 2009 -0700

    ipv4: Fix oops with FIB_TRIE

plus above mentionned patches for 2.6.30 (part 1 - 4)

-----------------

David, if possible, please add to all these "Fix... part 1 - 4":

Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>

This new patch below is intended only for -stable (and later for
net-next), because it doesn't meet rules of the current -rc. Anyway,
it's not critical (but it actually fixes a regression from 2.6.22).

Thanks,
Jarek P.
---------------->
ipv4: Fix fib_trie rebalancing, part 4 (root thresholds)

Pawel Staszewski wrote:
<blockquote>
Some time ago i report this:
http://bugzilla.kernel.org/show_bug.cgi?id=6648

and now with 2.6.29 / 2.6.29.1 / 2.6.29.3 and 2.6.30 it back
dmesg output:
oprofile: using NMI interrupt.
Fix inflate_threshold_root. Now=15 size=11 bits
...
Fix inflate_threshold_root. Now=15 size=11 bits

cat /proc/net/fib_triestat
Basic info: size of leaf: 40 bytes, size of tnode: 56 bytes.
Main:
        Aver depth:     2.28
        Max depth:      6
        Leaves:         276539
        Prefixes:       289922
        Internal nodes: 66762
          1: 35046  2: 13824  3: 9508  4: 4897  5: 2331  6: 1149  7: 5  
9: 1  18: 1
        Pointers: 691228
Null ptrs: 347928
Total size: 35709  kB
</blockquote>

It seems, the current threshold for root resizing is too aggressive,
and it causes misleading warnings during big updates, but it might be
also responsible for memory problems, especially with non-preempt
configs, when RCU freeing is delayed long after call_rcu.

It should be also mentioned that because of non-atomic changes during
resizing/rebalancing the current lookup algorithm can miss valid leaves
so it's additional argument to shorten these activities even at a cost
of a minimally longer searching.

This patch restores values before the patch "[IPV4]: fib_trie root
node settings", commit: 965ffea43d4ebe8cd7b9fee78d651268dd7d23c5 from
v2.6.22.

Pawel's report:
<blockquote>
I dont see any big change of (cpu load or faster/slower 
routing/propagating routes from bgpd or something else) - in avg there 
is from 2% to 3% more of CPU load i dont know why but it is - i change
from "preempt" to "no preempt" 3 times and check this my "mpstat -P ALL 
1 30"
always avg cpu load was from 2 to 3% more compared to "no preempt"
[...]
cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.44
        Max depth:      6
        Leaves:         277814
        Prefixes:       291306
        Internal nodes: 66420
          1: 32737  2: 14850  3: 10332  4: 4871  5: 2313  6: 942  7: 371  8: 3  17: 1
        Pointers: 599098
Null ptrs: 254865
Total size: 18067  kB
</blockquote>

According to this and other similar reports average depth is slightly
increased (~0.2), and root nodes are shorter (log 17 vs. 18), but
there is no visible performance decrease. So, until memory handling is
improved or added parameters for changing this individually, this
patch resets to safer defaults.

Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
---

 net/ipv4/fib_trie.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 00a54b2..63c2fa7 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -316,8 +316,8 @@ static inline void check_tnode(const struct tnode *tn)
 
 static const int halve_threshold = 25;
 static const int inflate_threshold = 50;
-static const int halve_threshold_root = 8;
-static const int inflate_threshold_root = 15;
+static const int halve_threshold_root = 15;
+static const int inflate_threshold_root = 25;
 
 
 static void __alias_free_mem(struct rcu_head *head)

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-05  0:30                                                     ` Paweł Staszewski
@ 2009-07-05 16:20                                                       ` Jarek Poplawski
  2009-07-05 17:32                                                         ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-05 16:20 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: Linux Network Development list, Robert Olsson, Paul E. McKenney

On Sun, Jul 05, 2009 at 02:30:03AM +0200, Paweł Staszewski wrote:
> Oh
>
> I forgot - please Jarek give me patch with sync rcu and i will make test  
> on preempt kernel

Probably non-preempt kernel might need something like this more, but
comparing is always interesting. This patch is based on Paul's
suggestion (I hope).

Thanks,
Jarek P.
---> (synchronize take 7; apply on top of the 2.6.29.x with the last
	all-in-one patch, or net-2.6)

 net/ipv4/fib_trie.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 00a54b2..fce8238 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -164,6 +164,7 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
 /* tnodes to free after resize(); protected by RTNL */
 static struct tnode *tnode_free_head;
+static size_t tnode_free_size;
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -393,6 +394,8 @@ static void tnode_free_safe(struct tnode *tn)
 	BUG_ON(IS_LEAF(tn));
 	tn->tnode_free = tnode_free_head;
 	tnode_free_head = tn;
+	tnode_free_size += sizeof(struct tnode) +
+			   (sizeof(struct node *) << tn->bits);
 }
 
 static void tnode_free_flush(void)
@@ -404,6 +407,11 @@ static void tnode_free_flush(void)
 		tn->tnode_free = NULL;
 		tnode_free(tn);
 	}
+
+	if (tnode_free_size >= PAGE_SIZE * 128) {
+		tnode_free_size = 0;
+		synchronize_rcu();
+	}
 }
 
 static struct leaf *leaf_new(void)

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-05 16:20                                                       ` Jarek Poplawski
@ 2009-07-05 17:32                                                         ` Jarek Poplawski
  2009-07-05 21:32                                                           ` Paul E. McKenney
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-05 17:32 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: Linux Network Development list, Robert Olsson, Paul E. McKenney

On Sun, Jul 05, 2009 at 06:20:03PM +0200, Jarek Poplawski wrote:
> On Sun, Jul 05, 2009 at 02:30:03AM +0200, Paweł Staszewski wrote:
> > Oh
> >
> > I forgot - please Jarek give me patch with sync rcu and i will make test  
> > on preempt kernel
> 
> Probably non-preempt kernel might need something like this more, but
> comparing is always interesting. This patch is based on Paul's
> suggestion (I hope).

Hold on ;-) Here is something even better... Syncing after 128 pages
might be still too slow, so here is a higher initial value, 1000, plus
you can change this while testing in:

/sys/module/fib_trie/parameters/sync_pages

It would be interesting to find the lowest acceptable value.

Jarek P.
---> (synchronize take 8; apply on top of the 2.6.29.x with the last
 	all-in-one patch, or net-2.6)

 net/ipv4/fib_trie.c |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 00a54b2..decc8d0 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -71,6 +71,7 @@
 #include <linux/netlink.h>
 #include <linux/init.h>
 #include <linux/list.h>
+#include <linux/moduleparam.h>
 #include <net/net_namespace.h>
 #include <net/ip.h>
 #include <net/protocol.h>
@@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
 /* tnodes to free after resize(); protected by RTNL */
 static struct tnode *tnode_free_head;
+static size_t tnode_free_size;
+
+static int sync_pages __read_mostly = 1000;
+module_param(sync_pages, int, 0640);
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -393,6 +398,8 @@ static void tnode_free_safe(struct tnode *tn)
 	BUG_ON(IS_LEAF(tn));
 	tn->tnode_free = tnode_free_head;
 	tnode_free_head = tn;
+	tnode_free_size += sizeof(struct tnode) +
+			   (sizeof(struct node *) << tn->bits);
 }
 
 static void tnode_free_flush(void)
@@ -404,6 +411,11 @@ static void tnode_free_flush(void)
 		tn->tnode_free = NULL;
 		tnode_free(tn);
 	}
+
+	if (tnode_free_size >= PAGE_SIZE * sync_pages) {
+		tnode_free_size = 0;
+		synchronize_rcu();
+	}
 }
 
 static struct leaf *leaf_new(void)

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-05 17:32                                                         ` Jarek Poplawski
@ 2009-07-05 21:32                                                           ` Paul E. McKenney
  2009-07-05 22:23                                                             ` Jarek Poplawski
                                                                               ` (2 more replies)
  0 siblings, 3 replies; 99+ messages in thread
From: Paul E. McKenney @ 2009-07-05 21:32 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Paweł Staszewski, Linux Network Development list, Robert Olsson

On Sun, Jul 05, 2009 at 07:32:08PM +0200, Jarek Poplawski wrote:
> On Sun, Jul 05, 2009 at 06:20:03PM +0200, Jarek Poplawski wrote:
> > On Sun, Jul 05, 2009 at 02:30:03AM +0200, Paweł Staszewski wrote:
> > > Oh
> > >
> > > I forgot - please Jarek give me patch with sync rcu and i will make test  
> > > on preempt kernel
> > 
> > Probably non-preempt kernel might need something like this more, but
> > comparing is always interesting. This patch is based on Paul's
> > suggestion (I hope).
> 
> Hold on ;-) Here is something even better... Syncing after 128 pages
> might be still too slow, so here is a higher initial value, 1000, plus
> you can change this while testing in:
> 
> /sys/module/fib_trie/parameters/sync_pages
> 
> It would be interesting to find the lowest acceptable value.

Looks like a promising approach to me!

							Thanx, Paul

> Jarek P.
> ---> (synchronize take 8; apply on top of the 2.6.29.x with the last
>  	all-in-one patch, or net-2.6)
> 
>  net/ipv4/fib_trie.c |   12 ++++++++++++
>  1 files changed, 12 insertions(+), 0 deletions(-)
> 
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 00a54b2..decc8d0 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -71,6 +71,7 @@
>  #include <linux/netlink.h>
>  #include <linux/init.h>
>  #include <linux/list.h>
> +#include <linux/moduleparam.h>
>  #include <net/net_namespace.h>
>  #include <net/ip.h>
>  #include <net/protocol.h>
> @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
>  static struct tnode *halve(struct trie *t, struct tnode *tn);
>  /* tnodes to free after resize(); protected by RTNL */
>  static struct tnode *tnode_free_head;
> +static size_t tnode_free_size;
> +
> +static int sync_pages __read_mostly = 1000;
> +module_param(sync_pages, int, 0640);
> 
>  static struct kmem_cache *fn_alias_kmem __read_mostly;
>  static struct kmem_cache *trie_leaf_kmem __read_mostly;
> @@ -393,6 +398,8 @@ static void tnode_free_safe(struct tnode *tn)
>  	BUG_ON(IS_LEAF(tn));
>  	tn->tnode_free = tnode_free_head;
>  	tnode_free_head = tn;
> +	tnode_free_size += sizeof(struct tnode) +
> +			   (sizeof(struct node *) << tn->bits);
>  }
> 
>  static void tnode_free_flush(void)
> @@ -404,6 +411,11 @@ static void tnode_free_flush(void)
>  		tn->tnode_free = NULL;
>  		tnode_free(tn);
>  	}
> +
> +	if (tnode_free_size >= PAGE_SIZE * sync_pages) {
> +		tnode_free_size = 0;
> +		synchronize_rcu();
> +	}
>  }
> 
>  static struct leaf *leaf_new(void)
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-05 21:32                                                           ` Paul E. McKenney
@ 2009-07-05 22:23                                                             ` Jarek Poplawski
  2009-07-05 23:53                                                               ` Paweł Staszewski
  2009-07-14 18:33                                                             ` [PATCH net-next] " Jarek Poplawski
  2009-07-14 21:20                                                             ` [PATCH net-next] ipv4: fib_trie: Use tnode_get_child_rcu() and node_parent_rcu() in lookups Jarek Poplawski
  2 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-05 22:23 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Paweł Staszewski, Linux Network Development list, Robert Olsson

On Sun, Jul 05, 2009 at 02:32:32PM -0700, Paul E. McKenney wrote:
> On Sun, Jul 05, 2009 at 07:32:08PM +0200, Jarek Poplawski wrote:
> > On Sun, Jul 05, 2009 at 06:20:03PM +0200, Jarek Poplawski wrote:
> > > On Sun, Jul 05, 2009 at 02:30:03AM +0200, Paweł Staszewski wrote:
> > > > Oh
> > > >
> > > > I forgot - please Jarek give me patch with sync rcu and i will make test  
> > > > on preempt kernel
> > > 
> > > Probably non-preempt kernel might need something like this more, but
> > > comparing is always interesting. This patch is based on Paul's
> > > suggestion (I hope).
> > 
> > Hold on ;-) Here is something even better... Syncing after 128 pages
> > might be still too slow, so here is a higher initial value, 1000, plus
> > you can change this while testing in:
> > 
> > /sys/module/fib_trie/parameters/sync_pages
> > 
> > It would be interesting to find the lowest acceptable value.
> 
> Looks like a promising approach to me!
> 
> 							Thanx, Paul

Hmm... As a matter of fact, I'm a bit sceptical now: I'm worrying this
synchronize_rcu done at the lowest acceptable rate could be actually
mostly idle or on the contrary too late. Probably some more complex
(per cpu?) accounting would be necessary to really matter here, but
on the other hand these problems weren't reported often enough.

Thanks,
Jarek P.

> > ---> (synchronize take 8; apply on top of the 2.6.29.x with the last
> >  	all-in-one patch, or net-2.6)
> > 
> >  net/ipv4/fib_trie.c |   12 ++++++++++++
> >  1 files changed, 12 insertions(+), 0 deletions(-)
> > 
> > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> > index 00a54b2..decc8d0 100644
> > --- a/net/ipv4/fib_trie.c
> > +++ b/net/ipv4/fib_trie.c
> > @@ -71,6 +71,7 @@
> >  #include <linux/netlink.h>
> >  #include <linux/init.h>
> >  #include <linux/list.h>
> > +#include <linux/moduleparam.h>
> >  #include <net/net_namespace.h>
> >  #include <net/ip.h>
> >  #include <net/protocol.h>
> > @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
> >  static struct tnode *halve(struct trie *t, struct tnode *tn);
> >  /* tnodes to free after resize(); protected by RTNL */
> >  static struct tnode *tnode_free_head;
> > +static size_t tnode_free_size;
> > +
> > +static int sync_pages __read_mostly = 1000;
> > +module_param(sync_pages, int, 0640);
> > 
> >  static struct kmem_cache *fn_alias_kmem __read_mostly;
> >  static struct kmem_cache *trie_leaf_kmem __read_mostly;
> > @@ -393,6 +398,8 @@ static void tnode_free_safe(struct tnode *tn)
> >  	BUG_ON(IS_LEAF(tn));
> >  	tn->tnode_free = tnode_free_head;
> >  	tnode_free_head = tn;
> > +	tnode_free_size += sizeof(struct tnode) +
> > +			   (sizeof(struct node *) << tn->bits);
> >  }
> > 
> >  static void tnode_free_flush(void)
> > @@ -404,6 +411,11 @@ static void tnode_free_flush(void)
> >  		tn->tnode_free = NULL;
> >  		tnode_free(tn);
> >  	}
> > +
> > +	if (tnode_free_size >= PAGE_SIZE * sync_pages) {
> > +		tnode_free_size = 0;
> > +		synchronize_rcu();
> > +	}
> >  }
> > 
> >  static struct leaf *leaf_new(void)
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-05 22:23                                                             ` Jarek Poplawski
@ 2009-07-05 23:53                                                               ` Paweł Staszewski
  2009-07-06  9:02                                                                 ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-07-05 23:53 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Paul E. McKenney, Linux Network Development list, Robert Olsson

kernel 2.6.29.5 preempt
bgp starts normal and kernel know routes normaly like without patch

Here are some fib_triestats

cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.44
        Max depth:      6
        Leaves:         277888
        Prefixes:       291399
        Internal nodes: 66818
          1: 33080  2: 14584  3: 10788  4: 4911  5: 2185  6: 900  7: 
366  8: 3  17: 1
        Pointers: 595584
Null ptrs: 250879
Total size: 18072  kB

Counters:
---------
gets = 1052940
backtracks = 55985
semantic match passed = 1034114
semantic match miss = 5
null node hit= 534415
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 1057636
backtracks = 1101307
semantic match passed = 4751
semantic match miss = 0
null node hit= 195605
skipped node resize = 0




kernel 2.6.29.5 no-preempt
All is ok like with preempt kernel (andl all working in normal time 
"routes propagation")

cat /sys/module/fib_trie/parameters/sync_pages
1000


cat /proc/net/fib_triestat
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.45
        Max depth:      6
        Leaves:         277905
        Prefixes:       291416
        Internal nodes: 66863
          1: 33119  2: 14594  3: 10782  4: 4911  5: 2187  6: 901  7: 
365  8: 3  17: 1
        Pointers: 595654
Null ptrs: 250887
Total size: 18074  kB

Counters:
---------
gets = 1060650
backtracks = 53161
semantic match passed = 1041008
semantic match miss = 12
null node hit= 504478
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 1065517
backtracks = 1095422
semantic match passed = 4954
semantic match miss = 0
null node hit= 195584
skipped node resize = 0

So i make tests with changing sync_pages
And

####################################
sync_pages: 64
total size reach maximum in 17sec

Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.43
        Max depth:      6
        Leaves:         271928
        Prefixes:       285435
        Internal nodes: 66185
          1: 32904  2: 14554  3: 10740  4: 4677  5: 2047  6: 901  7: 
361  17: 1
        Pointers: 585224
Null ptrs: 247112
Total size: 17729  kB

Counters:
---------
gets = 5313544
backtracks = 230501
semantic match passed = 5233998
semantic match miss = 61
null node hit= 2757531
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 5332471
backtracks = 4708505
semantic match passed = 19264
semantic match miss = 0
null node hit= 782757
skipped node resize = 0



######################################
sync_pages: 128
Fib trie Total size reach max in 14sec
Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.44
        Max depth:      6
        Leaves:         277915
        Prefixes:       291427
        Internal nodes: 66832
          1: 33085  2: 14597  3: 10785  4: 4908  5: 2187  6: 900  7: 
366  8: 3  17: 1
        Pointers: 595638
Null ptrs: 250892
Total size: 18074  kB

Counters:
---------
gets = 6698058
backtracks = 307491
semantic match passed = 6593421
semantic match miss = 66
null node hit= 3498560
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 6721120
backtracks = 5934017
semantic match passed = 23440
semantic match miss = 0
null node hit= 978008
skipped node resize = 0

#########################################
sync_pages: 256
hmm no difference also in 10sec

Basic info: size of leaf: 20 bytes, size of tnode: 36 bytes.
Main:
        Aver depth:     2.44
        Max depth:      6
        Leaves:         277913
        Prefixes:       291425
        Internal nodes: 66829
          1: 33082  2: 14596  3: 10786  4: 4909  5: 2186  6: 900  7: 
366  8: 3  17: 1
        Pointers: 595620
Null ptrs: 250879
Total size: 18073  kB

Counters:
---------
gets = 4637474
backtracks = 188624
semantic match passed = 4577266
semantic match miss = 61
null node hit= 2451890
skipped node resize = 0

Local:
        Aver depth:     3.75
        Max depth:      5
        Leaves:         12
        Prefixes:       13
        Internal nodes: 10
          1: 9  2: 1
        Pointers: 22
Null ptrs: 1
Total size: 2  kB

Counters:
---------
gets = 4651791
backtracks = 3716400
semantic match passed = 14613
semantic match miss = 0
null node hit= 587208
skipped node resize = 0


And with sync_pages higher that 256 time of filling kernel routes is the 
same approx 10sec.


I make this test bu use:
watch -n1 cat /proc/net/fib_triestat
timer start when Total size was 1kB and stop when Total size reach 18073  kB


Regards
Paweł Staszewski

Jarek Poplawski pisze:
> On Sun, Jul 05, 2009 at 02:32:32PM -0700, Paul E. McKenney wrote:
>   
>> On Sun, Jul 05, 2009 at 07:32:08PM +0200, Jarek Poplawski wrote:
>>     
>>> On Sun, Jul 05, 2009 at 06:20:03PM +0200, Jarek Poplawski wrote:
>>>       
>>>> On Sun, Jul 05, 2009 at 02:30:03AM +0200, Paweł Staszewski wrote:
>>>>         
>>>>> Oh
>>>>>
>>>>> I forgot - please Jarek give me patch with sync rcu and i will make test  
>>>>> on preempt kernel
>>>>>           
>>>> Probably non-preempt kernel might need something like this more, but
>>>> comparing is always interesting. This patch is based on Paul's
>>>> suggestion (I hope).
>>>>         
>>> Hold on ;-) Here is something even better... Syncing after 128 pages
>>> might be still too slow, so here is a higher initial value, 1000, plus
>>> you can change this while testing in:
>>>
>>> /sys/module/fib_trie/parameters/sync_pages
>>>
>>> It would be interesting to find the lowest acceptable value.
>>>       
>> Looks like a promising approach to me!
>>
>> 							Thanx, Paul
>>     
>
> Hmm... As a matter of fact, I'm a bit sceptical now: I'm worrying this
> synchronize_rcu done at the lowest acceptable rate could be actually
> mostly idle or on the contrary too late. Probably some more complex
> (per cpu?) accounting would be necessary to really matter here, but
> on the other hand these problems weren't reported often enough.
>
> Thanks,
> Jarek P.
>
>   
>>> ---> (synchronize take 8; apply on top of the 2.6.29.x with the last
>>>  	all-in-one patch, or net-2.6)
>>>
>>>  net/ipv4/fib_trie.c |   12 ++++++++++++
>>>  1 files changed, 12 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>>> index 00a54b2..decc8d0 100644
>>> --- a/net/ipv4/fib_trie.c
>>> +++ b/net/ipv4/fib_trie.c
>>> @@ -71,6 +71,7 @@
>>>  #include <linux/netlink.h>
>>>  #include <linux/init.h>
>>>  #include <linux/list.h>
>>> +#include <linux/moduleparam.h>
>>>  #include <net/net_namespace.h>
>>>  #include <net/ip.h>
>>>  #include <net/protocol.h>
>>> @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
>>>  static struct tnode *halve(struct trie *t, struct tnode *tn);
>>>  /* tnodes to free after resize(); protected by RTNL */
>>>  static struct tnode *tnode_free_head;
>>> +static size_t tnode_free_size;
>>> +
>>> +static int sync_pages __read_mostly = 1000;
>>> +module_param(sync_pages, int, 0640);
>>>
>>>  static struct kmem_cache *fn_alias_kmem __read_mostly;
>>>  static struct kmem_cache *trie_leaf_kmem __read_mostly;
>>> @@ -393,6 +398,8 @@ static void tnode_free_safe(struct tnode *tn)
>>>  	BUG_ON(IS_LEAF(tn));
>>>  	tn->tnode_free = tnode_free_head;
>>>  	tnode_free_head = tn;
>>> +	tnode_free_size += sizeof(struct tnode) +
>>> +			   (sizeof(struct node *) << tn->bits);
>>>  }
>>>
>>>  static void tnode_free_flush(void)
>>> @@ -404,6 +411,11 @@ static void tnode_free_flush(void)
>>>  		tn->tnode_free = NULL;
>>>  		tnode_free(tn);
>>>  	}
>>> +
>>> +	if (tnode_free_size >= PAGE_SIZE * sync_pages) {
>>> +		tnode_free_size = 0;
>>> +		synchronize_rcu();
>>> +	}
>>>  }
>>>
>>>  static struct leaf *leaf_new(void)
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>       
>
>
>   


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-05 23:53                                                               ` Paweł Staszewski
@ 2009-07-06  9:02                                                                 ` Jarek Poplawski
  2009-07-07 22:56                                                                   ` Paweł Staszewski
  2009-07-07 23:23                                                                   ` [PATCH net-2.6] " Paweł Staszewski
  0 siblings, 2 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-06  9:02 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: Paul E. McKenney, Linux Network Development list, Robert Olsson

On Mon, Jul 06, 2009 at 01:53:49AM +0200, Paweł Staszewski wrote:
...
> So i make tests with changing sync_pages
> And
>
> ####################################
> sync_pages: 64
> total size reach maximum in 17sec
...
> ######################################
> sync_pages: 128
> Fib trie Total size reach max in 14sec
...
> #########################################
> sync_pages: 256
> hmm no difference also in 10sec

14 == 10!? ;-)
...
> And with sync_pages higher that 256 time of filling kernel routes is the  
> same approx 10sec.

Hmm... So, it's better than I expected; syncing after 128 or 256 pages
could be quite reasonable. But then it would be interesting to find
out if with such a safety we could go back to more aggressive values
for possibly better performance. So here is 'the same' patch (so the
previous, take 8, should be reverted), but with additional possibility
to change:
/sys/module/fib_trie/parameters/inflate_threshold_root

I guess, you could try e.g. if: sync_pages 256, inflate_threshold_root 15
can give faster lookups (or lower cpu loads); with this these inflate
warnings could be back btw.; or maybe you'll find something in between
like inflate_threshold_root 20 is optimal for you. (I think it should be
enough to try this only for PREEMPT_NONE unless you have spare time ;-)

Thanks,
Jarek P.
---> (synchronize take 9; apply on top of the 2.6.29.x with the last
  	all-in-one patch, or net-2.6)

 net/ipv4/fib_trie.c |   18 ++++++++++++++++--
 1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 00a54b2..e8fca11 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -71,6 +71,7 @@
 #include <linux/netlink.h>
 #include <linux/init.h>
 #include <linux/list.h>
+#include <linux/moduleparam.h>
 #include <net/net_namespace.h>
 #include <net/ip.h>
 #include <net/protocol.h>
@@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
 /* tnodes to free after resize(); protected by RTNL */
 static struct tnode *tnode_free_head;
+static size_t tnode_free_size;
+
+static int sync_pages __read_mostly = 1000;
+module_param(sync_pages, int, 0640);
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -316,9 +321,11 @@ static inline void check_tnode(const struct tnode *tn)
 
 static const int halve_threshold = 25;
 static const int inflate_threshold = 50;
-static const int halve_threshold_root = 15;
-static const int inflate_threshold_root = 25;
 
+static int inflate_threshold_root __read_mostly = 25;
+module_param(inflate_threshold_root, int, 0640);
+
+#define halve_threshold_root	(inflate_threshold_root / 2 + 1)
 
 static void __alias_free_mem(struct rcu_head *head)
 {
@@ -393,6 +400,8 @@ static void tnode_free_safe(struct tnode *tn)
 	BUG_ON(IS_LEAF(tn));
 	tn->tnode_free = tnode_free_head;
 	tnode_free_head = tn;
+	tnode_free_size += sizeof(struct tnode) +
+			   (sizeof(struct node *) << tn->bits);
 }
 
 static void tnode_free_flush(void)
@@ -404,6 +413,11 @@ static void tnode_free_flush(void)
 		tn->tnode_free = NULL;
 		tnode_free(tn);
 	}
+
+	if (tnode_free_size >= PAGE_SIZE * sync_pages) {
+		tnode_free_size = 0;
+		synchronize_rcu();
+	}
 }
 
 static struct leaf *leaf_new(void)

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-06  9:02                                                                 ` Jarek Poplawski
@ 2009-07-07 22:56                                                                   ` Paweł Staszewski
  2009-07-07 23:50                                                                     ` Jarek Poplawski
  2009-07-07 23:23                                                                   ` [PATCH net-2.6] " Paweł Staszewski
  1 sibling, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-07-07 22:56 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Paul E. McKenney, Linux Network Development list, Robert Olsson

Jarek Poplawski pisze:
> On Mon, Jul 06, 2009 at 01:53:49AM +0200, Paweł Staszewski wrote:
> ...
>   
>> So i make tests with changing sync_pages
>> And
>>
>> ####################################
>> sync_pages: 64
>> total size reach maximum in 17sec
>>     
> ...
>   
>> ######################################
>> sync_pages: 128
>> Fib trie Total size reach max in 14sec
>>     
> ...
>   
>> #########################################
>> sync_pages: 256
>> hmm no difference also in 10sec
>>     
>
> 14 == 10!? ;-)
> ...
>   
>> And with sync_pages higher that 256 time of filling kernel routes is the  
>> same approx 10sec.
>>     
>
> Hmm... So, it's better than I expected; syncing after 128 or 256 pages
> could be quite reasonable. But then it would be interesting to find
> out if with such a safety we could go back to more aggressive values
> for possibly better performance. So here is 'the same' patch (so the
> previous, take 8, should be reverted), but with additional possibility
> to change:
> /sys/module/fib_trie/parameters/inflate_threshold_root
>
> I guess, you could try e.g. if: sync_pages 256, inflate_threshold_root 15
> can give faster lookups (or lower cpu loads); with this these inflate
> warnings could be back btw.; or maybe you'll find something in between
> like inflate_threshold_root 20 is optimal for you. (I think it should be
> enough to try this only for PREEMPT_NONE unless you have spare time ;-)
>
> Thanks,
> Jarek P.
> ---> (synchronize take 9; apply on top of the 2.6.29.x with the last
>   	all-in-one patch, or net-2.6)
>
>   

Applied to 2.6.29.5 preempt/no-preempt and tested: - with preempt i make 
only one test with sync_pages = 256 to check that is working :)

So here are some tests for different sync_pages size.

echo 1 > /sys/module/fib_trie/parameters/sync_pages
I stop count after 1minute - total size still rising :)

echo 2 > /sys/module/fib_trie/parameters/sync_pages
Total size in fib_triestats reach maximum in 33sec

echo 3 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 31sec

echo 4 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 23sec

echo 8 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 17sec

echo 16 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 14 sec

echo 32 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 14 sec

So i see in prev tests i make something wrong in time counting
So i modify test script and make tests again:

echo 64 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 13 sec

echo 128 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 10 sec

echo 256 > /sys/module/fib_trie/parameters/sync_pages
Total size reach max in 10 sec

And for sync_paqges >256 time for propagating routes is always 10sec.

Also today i have many messages in dmesg like this:
Fix inflate_threshold_root. Now=25 size=11 bits
:)

And after tune :
/sys/module/fib_trie/parameters/inflate_threshold_root
no more info :)

Regards
Paweł Staszewski

>  net/ipv4/fib_trie.c |   18 ++++++++++++++++--
>  1 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 00a54b2..e8fca11 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -71,6 +71,7 @@
>  #include <linux/netlink.h>
>  #include <linux/init.h>
>  #include <linux/list.h>
> +#include <linux/moduleparam.h>
>  #include <net/net_namespace.h>
>  #include <net/ip.h>
>  #include <net/protocol.h>
> @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
>  static struct tnode *halve(struct trie *t, struct tnode *tn);
>  /* tnodes to free after resize(); protected by RTNL */
>  static struct tnode *tnode_free_head;
> +static size_t tnode_free_size;
> +
> +static int sync_pages __read_mostly = 1000;
> +module_param(sync_pages, int, 0640);
>  
>  static struct kmem_cache *fn_alias_kmem __read_mostly;
>  static struct kmem_cache *trie_leaf_kmem __read_mostly;
> @@ -316,9 +321,11 @@ static inline void check_tnode(const struct tnode *tn)
>  
>  static const int halve_threshold = 25;
>  static const int inflate_threshold = 50;
> -static const int halve_threshold_root = 15;
> -static const int inflate_threshold_root = 25;
>  
> +static int inflate_threshold_root __read_mostly = 25;
> +module_param(inflate_threshold_root, int, 0640);
> +
> +#define halve_threshold_root	(inflate_threshold_root / 2 + 1)
>  
>  static void __alias_free_mem(struct rcu_head *head)
>  {
> @@ -393,6 +400,8 @@ static void tnode_free_safe(struct tnode *tn)
>  	BUG_ON(IS_LEAF(tn));
>  	tn->tnode_free = tnode_free_head;
>  	tnode_free_head = tn;
> +	tnode_free_size += sizeof(struct tnode) +
> +			   (sizeof(struct node *) << tn->bits);
>  }
>  
>  static void tnode_free_flush(void)
> @@ -404,6 +413,11 @@ static void tnode_free_flush(void)
>  		tn->tnode_free = NULL;
>  		tnode_free(tn);
>  	}
> +
> +	if (tnode_free_size >= PAGE_SIZE * sync_pages) {
> +		tnode_free_size = 0;
> +		synchronize_rcu();
> +	}
>  }
>  
>  static struct leaf *leaf_new(void)
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>   


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-06  9:02                                                                 ` Jarek Poplawski
  2009-07-07 22:56                                                                   ` Paweł Staszewski
@ 2009-07-07 23:23                                                                   ` Paweł Staszewski
  2009-07-07 23:30                                                                     ` Paweł Staszewski
  1 sibling, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-07-07 23:23 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Paul E. McKenney, Linux Network Development list, Robert Olsson

[-- Attachment #1: Type: text/plain, Size: 4397 bytes --]

Jarek Poplawski pisze:
> On Mon, Jul 06, 2009 at 01:53:49AM +0200, Paweł Staszewski wrote:
> ...
>   
>> So i make tests with changing sync_pages
>> And
>>
>> ####################################
>> sync_pages: 64
>> total size reach maximum in 17sec
>>     
> ...
>   
>> ######################################
>> sync_pages: 128
>> Fib trie Total size reach max in 14sec
>>     
> ...
>   
>> #########################################
>> sync_pages: 256
>> hmm no difference also in 10sec
>>     
>
> 14 == 10!? ;-)
> ...
>   
:) i miss one test
>> And with sync_pages higher that 256 time of filling kernel routes is the  
>> same approx 10sec.
>>     
>
> Hmm... So, it's better than I expected; syncing after 128 or 256 pages
> could be quite reasonable. But then it would be interesting to find
> out if with such a safety we could go back to more aggressive values
> for possibly better performance. So here is 'the same' patch (so the
> previous, take 8, should be reverted), but with additional possibility
> to change:
> /sys/module/fib_trie/parameters/inflate_threshold_root
>
> I guess, you could try e.g. if: sync_pages 256, inflate_threshold_root 15
> can give faster lookups (or lower cpu loads); with this these inflate
> warnings could be back btw.; or maybe you'll find something in between
> like inflate_threshold_root 20 is optimal for you. (I think it should be
> enough to try this only for PREEMPT_NONE unless you have spare time ;-)
>
>   
And i can't make good tests with cpu load because of problem that i have 
from "weird problem" emails
It depend when i make mpstat to check cpu load and for what time because 
every 15 sec i have 1 do 3 % of cpu and after 15 sec i have almost 40% 
cpu load for next 15 sec.
I try to make mpstat -P ALL 1 60
but after 15 sec of 1 to 3 % cpu load this next higher cpu load if 
different everytime it balance from 30 to 50%

so i make test shorter when cpu load is 1 to 3 % - "mpstat -P ALL 1 10"
output in attached file

Regards
Paweł Staszewski


> Thanks,
> Jarek P.
> ---> (synchronize take 9; apply on top of the 2.6.29.x with the last
>   	all-in-one patch, or net-2.6)
>
>  net/ipv4/fib_trie.c |   18 ++++++++++++++++--
>  1 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> index 00a54b2..e8fca11 100644
> --- a/net/ipv4/fib_trie.c
> +++ b/net/ipv4/fib_trie.c
> @@ -71,6 +71,7 @@
>  #include <linux/netlink.h>
>  #include <linux/init.h>
>  #include <linux/list.h>
> +#include <linux/moduleparam.h>
>  #include <net/net_namespace.h>
>  #include <net/ip.h>
>  #include <net/protocol.h>
> @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
>  static struct tnode *halve(struct trie *t, struct tnode *tn);
>  /* tnodes to free after resize(); protected by RTNL */
>  static struct tnode *tnode_free_head;
> +static size_t tnode_free_size;
> +
> +static int sync_pages __read_mostly = 1000;
> +module_param(sync_pages, int, 0640);
>  
>  static struct kmem_cache *fn_alias_kmem __read_mostly;
>  static struct kmem_cache *trie_leaf_kmem __read_mostly;
> @@ -316,9 +321,11 @@ static inline void check_tnode(const struct tnode *tn)
>  
>  static const int halve_threshold = 25;
>  static const int inflate_threshold = 50;
> -static const int halve_threshold_root = 15;
> -static const int inflate_threshold_root = 25;
>  
> +static int inflate_threshold_root __read_mostly = 25;
> +module_param(inflate_threshold_root, int, 0640);
> +
> +#define halve_threshold_root	(inflate_threshold_root / 2 + 1)
>  
>  static void __alias_free_mem(struct rcu_head *head)
>  {
> @@ -393,6 +400,8 @@ static void tnode_free_safe(struct tnode *tn)
>  	BUG_ON(IS_LEAF(tn));
>  	tn->tnode_free = tnode_free_head;
>  	tnode_free_head = tn;
> +	tnode_free_size += sizeof(struct tnode) +
> +			   (sizeof(struct node *) << tn->bits);
>  }
>  
>  static void tnode_free_flush(void)
> @@ -404,6 +413,11 @@ static void tnode_free_flush(void)
>  		tn->tnode_free = NULL;
>  		tnode_free(tn);
>  	}
> +
> +	if (tnode_free_size >= PAGE_SIZE * sync_pages) {
> +		tnode_free_size = 0;
> +		synchronize_rcu();
> +	}
>  }
>  
>  static struct leaf *leaf_new(void)
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>   


[-- Attachment #2: sync_pages.txt --]
[-- Type: text/plain, Size: 3200 bytes --]

sync_pages: 256
inflate_threshold_root: 10
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
Average:     all    0.00    0.00    0.00    0.00    0.00    0.60    0.00    0.00   99.40
Average:       0    0.00    0.00    0.00    0.00    0.00    0.60    0.00    0.00   99.40
Average:       1    0.00    0.00    0.00    0.00    0.00    0.40    0.00    0.00   99.60

sync_pages: 256
inflate_threshold_root: 15
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
Average:     all    0.00    0.00    0.10    0.00    0.00    0.70    0.00    0.00   99.20
Average:       0    0.00    0.00    0.00    0.00    0.20    0.80    0.00    0.00   99.00
Average:       1    0.00    0.00    0.20    0.00    0.00    0.61    0.00    0.00   99.19

sync_pages: 256
inflate_threshold_root: 20
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
Average:     all    0.00    0.00    0.00    0.00    0.10    0.80    0.00    0.00   99.10
Average:       0    0.00    0.00    0.00    0.00    0.00    1.00    0.00    0.00   99.00
Average:       1    0.00    0.00    0.00    0.00    0.00    0.61    0.00    0.00   99.39

sync_pages: 256
inflate_threshold_root: 25
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
Average:     all    0.00    0.00    0.00    0.00    0.00    0.70    0.00    0.00   99.30
Average:       0    0.00    0.00    0.00    0.00    0.20    1.00    0.00    0.00   98.80
Average:       1    0.00    0.00    0.00    0.00    0.00    0.40    0.00    0.00   99.60


sync_pages: 512
inflate_threshold_root: 10
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
Average:     all    0.00    0.00    0.10    0.00    0.10    0.60    0.00    0.00   99.20
Average:       0    0.00    0.00    0.20    0.00    0.00    1.00    0.00    0.00   98.80
Average:       1    0.00    0.00    0.00    0.00    0.00    0.40    0.00    0.00   99.60

sync_pages: 512
inflate_threshold_root: 15
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
Average:     all    0.00    0.00    0.20    0.00    0.00    1.10    0.00    0.00   98.70
Average:       0    0.00    0.00    0.40    0.00    0.00    1.00    0.00    0.00   98.60
Average:       1    0.00    0.00    0.00    0.00    0.00    1.01    0.00    0.00   98.99

sync_pages: 512
inflate_threshold_root: 20
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
Average:     all    0.00    0.00    0.10    0.00    0.10    1.01    0.00    0.00   98.79
Average:       0    0.00    0.00    0.20    0.00    0.20    1.40    0.00    0.00   98.20
Average:       1    0.00    0.00    0.00    0.00    0.00    0.61    0.00    0.00   99.39

sync_pages: 512
inflate_threshold_root: 25
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
Average:     all    0.00    0.00    0.00    0.00    0.10    0.90    0.00    0.00   99.00
Average:       0    0.00    0.00    0.00    0.00    0.00    1.00    0.00    0.00   99.00
Average:       1    0.00    0.00    0.00    0.00    0.20    0.80    0.00    0.00   99.00

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-07 23:23                                                                   ` [PATCH net-2.6] " Paweł Staszewski
@ 2009-07-07 23:30                                                                     ` Paweł Staszewski
  0 siblings, 0 replies; 99+ messages in thread
From: Paweł Staszewski @ 2009-07-07 23:30 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Paul E. McKenney, Linux Network Development list, Robert Olsson

Paweł Staszewski pisze:
> Jarek Poplawski pisze:
>> On Mon, Jul 06, 2009 at 01:53:49AM +0200, Paweł Staszewski wrote:
>> ...
>>  
>>> So i make tests with changing sync_pages
>>> And
>>>
>>> ####################################
>>> sync_pages: 64
>>> total size reach maximum in 17sec
>>>     
>> ...
>>  
>>> ######################################
>>> sync_pages: 128
>>> Fib trie Total size reach max in 14sec
>>>     
>> ...
>>  
>>> #########################################
>>> sync_pages: 256
>>> hmm no difference also in 10sec
>>>     
>>
>> 14 == 10!? ;-)
>> ...
>>   
> :) i miss one test
>>> And with sync_pages higher that 256 time of filling kernel routes is 
>>> the  same approx 10sec.
>>>     
>>
>> Hmm... So, it's better than I expected; syncing after 128 or 256 pages
>> could be quite reasonable. But then it would be interesting to find
>> out if with such a safety we could go back to more aggressive values
>> for possibly better performance. So here is 'the same' patch (so the
>> previous, take 8, should be reverted), but with additional possibility
>> to change:
>> /sys/module/fib_trie/parameters/inflate_threshold_root
>>
>> I guess, you could try e.g. if: sync_pages 256, 
>> inflate_threshold_root 15
>> can give faster lookups (or lower cpu loads); with this these inflate
>> warnings could be back btw.; or maybe you'll find something in between
>> like inflate_threshold_root 20 is optimal for you. (I think it should be
>> enough to try this only for PREEMPT_NONE unless you have spare time ;-)
>>
>>   
> And i can't make good tests with cpu load because of problem that i 
> have from "weird problem" emails
> It depend when i make mpstat to check cpu load and for what time 
> because every 15 sec i have 1 do 3 % of cpu and after 15 sec i have 
> almost 40% cpu load for next 15 sec.
> I try to make mpstat -P ALL 1 60
> but after 15 sec of 1 to 3 % cpu load this next higher cpu load if 
> different everytime it balance from 30 to 50%
>
> so i make test shorter when cpu load is 1 to 3 % - "mpstat -P ALL 1 10"
> output in attached file
>
> Regards
> Paweł Staszewski
>
i forgot to add:
Traffic when i make test was +/- 10Mbit/s in next tests:
eth0:         RX: 231.21 Mb/s          TX: 287.40 Mb/s    
eth1:         RX: 289.19 Mb/s          TX: 231.35 Mb/s    

>
>> Thanks,
>> Jarek P.
>> ---> (synchronize take 9; apply on top of the 2.6.29.x with the last
>>       all-in-one patch, or net-2.6)
>>
>>  net/ipv4/fib_trie.c |   18 ++++++++++++++++--
>>  1 files changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
>> index 00a54b2..e8fca11 100644
>> --- a/net/ipv4/fib_trie.c
>> +++ b/net/ipv4/fib_trie.c
>> @@ -71,6 +71,7 @@
>>  #include <linux/netlink.h>
>>  #include <linux/init.h>
>>  #include <linux/list.h>
>> +#include <linux/moduleparam.h>
>>  #include <net/net_namespace.h>
>>  #include <net/ip.h>
>>  #include <net/protocol.h>
>> @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, 
>> struct tnode *tn);
>>  static struct tnode *halve(struct trie *t, struct tnode *tn);
>>  /* tnodes to free after resize(); protected by RTNL */
>>  static struct tnode *tnode_free_head;
>> +static size_t tnode_free_size;
>> +
>> +static int sync_pages __read_mostly = 1000;
>> +module_param(sync_pages, int, 0640);
>>  
>>  static struct kmem_cache *fn_alias_kmem __read_mostly;
>>  static struct kmem_cache *trie_leaf_kmem __read_mostly;
>> @@ -316,9 +321,11 @@ static inline void check_tnode(const struct 
>> tnode *tn)
>>  
>>  static const int halve_threshold = 25;
>>  static const int inflate_threshold = 50;
>> -static const int halve_threshold_root = 15;
>> -static const int inflate_threshold_root = 25;
>>  
>> +static int inflate_threshold_root __read_mostly = 25;
>> +module_param(inflate_threshold_root, int, 0640);
>> +
>> +#define halve_threshold_root    (inflate_threshold_root / 2 + 1)
>>  
>>  static void __alias_free_mem(struct rcu_head *head)
>>  {
>> @@ -393,6 +400,8 @@ static void tnode_free_safe(struct tnode *tn)
>>      BUG_ON(IS_LEAF(tn));
>>      tn->tnode_free = tnode_free_head;
>>      tnode_free_head = tn;
>> +    tnode_free_size += sizeof(struct tnode) +
>> +               (sizeof(struct node *) << tn->bits);
>>  }
>>  
>>  static void tnode_free_flush(void)
>> @@ -404,6 +413,11 @@ static void tnode_free_flush(void)
>>          tn->tnode_free = NULL;
>>          tnode_free(tn);
>>      }
>> +
>> +    if (tnode_free_size >= PAGE_SIZE * sync_pages) {
>> +        tnode_free_size = 0;
>> +        synchronize_rcu();
>> +    }
>>  }
>>  
>>  static struct leaf *leaf_new(void)
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>   
>


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-07 22:56                                                                   ` Paweł Staszewski
@ 2009-07-07 23:50                                                                     ` Jarek Poplawski
  2009-07-09 20:34                                                                       ` Paweł Staszewski
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-07 23:50 UTC (permalink / raw)
  To: Paweł Staszewski
  Cc: Paul E. McKenney, Linux Network Development list, Robert Olsson

On Wed, Jul 08, 2009 at 12:56:13AM +0200, Paweł Staszewski wrote:
> Jarek Poplawski pisze:
...
>> ---> (synchronize take 9; apply on top of the 2.6.29.x with the last
>>   	all-in-one patch, or net-2.6)
>>
>>   
>
> Applied to 2.6.29.5 preempt/no-preempt and tested: - with preempt i make  
> only one test with sync_pages = 256 to check that is working :)
>
> So here are some tests for different sync_pages size.
...
> So i see in prev tests i make something wrong in time counting
> So i modify test script and make tests again:
>
> echo 64 > /sys/module/fib_trie/parameters/sync_pages
> Total size reach max in 13 sec
>
> echo 128 > /sys/module/fib_trie/parameters/sync_pages
> Total size reach max in 10 sec
>
> echo 256 > /sys/module/fib_trie/parameters/sync_pages
> Total size reach max in 10 sec
>
> And for sync_paqges >256 time for propagating routes is always 10sec.

So this means sync_pages 128 or 256 is reasonable.

>
> Also today i have many messages in dmesg like this:
> Fix inflate_threshold_root. Now=25 size=11 bits
> :)

This is something new and a bit surprising to me: the same threshold
in previous tests didn't generate this? Do you mean more than: 
"Fix inflate_threshold_root. Now=15 size=11 bits" before?

> And after tune :
> /sys/module/fib_trie/parameters/inflate_threshold_root
> no more info :)

With what value?

Pawel, let's say that current defaults are:
inflate_threshold_root 25 sync_pages 256

I'd like you to try to check if e.g.:
inflate_threshold_root 15 sync_pages 256
can give you any visible or subjective difference worth tweaking it
at all? (These stats from the next messages don't show this enough.)
You don't need to hurry with this...

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v2 -stable] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-05 13:08                                                     ` [PATCH v2 " Jarek Poplawski
@ 2009-07-08  2:42                                                       ` David Miller
  2009-07-08  6:44                                                         ` Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: David Miller @ 2009-07-08  2:42 UTC (permalink / raw)
  To: jarkao2; +Cc: pstaszewski, netdev, robert, jorge

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Sun, 5 Jul 2009 15:08:28 +0200

> This new patch below is intended only for -stable (and later for
> net-next), because it doesn't meet rules of the current -rc. Anyway,
> it's not critical (but it actually fixes a regression from 2.6.22).

I think if we' re going to toss this into -stable, we should
put it into net-2.6 too, and that's what I'm going to do.

Once this makes it's way to Linus I'll work on the -stable
submissions.

And I'll make sure to add the tested-by tags, as you mentioned.

Thanks!

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH v2 -stable] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-08  2:42                                                       ` David Miller
@ 2009-07-08  6:44                                                         ` Jarek Poplawski
  0 siblings, 0 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-08  6:44 UTC (permalink / raw)
  To: David Miller; +Cc: pstaszewski, netdev, robert, jorge

On Tue, Jul 07, 2009 at 07:42:08PM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Sun, 5 Jul 2009 15:08:28 +0200
> 
> > This new patch below is intended only for -stable (and later for
> > net-next), because it doesn't meet rules of the current -rc. Anyway,
> > it's not critical (but it actually fixes a regression from 2.6.22).
> 
> I think if we' re going to toss this into -stable, we should
> put it into net-2.6 too, and that's what I'm going to do.

It's your decision: I don't think this patch is worth any arguing
about (de)stabilizing. Btw., since -stable rules are less strict it
seems natural such patches with bug fixes should rather go net-next
-> -stable way, unless I miss something?

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-07 23:50                                                                     ` Jarek Poplawski
@ 2009-07-09 20:34                                                                       ` Paweł Staszewski
  2009-07-14 19:41                                                                         ` [PATCH net-next] " Jarek Poplawski
  0 siblings, 1 reply; 99+ messages in thread
From: Paweł Staszewski @ 2009-07-09 20:34 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Paul E. McKenney, Linux Network Development list, Robert Olsson

Jarek Poplawski pisze:
> On Wed, Jul 08, 2009 at 12:56:13AM +0200, Paweł Staszewski wrote:
>   
>> Jarek Poplawski pisze:
>>     
> ...
>   
>>> ---> (synchronize take 9; apply on top of the 2.6.29.x with the last
>>>   	all-in-one patch, or net-2.6)
>>>
>>>   
>>>       
>> Applied to 2.6.29.5 preempt/no-preempt and tested: - with preempt i make  
>> only one test with sync_pages = 256 to check that is working :)
>>
>> So here are some tests for different sync_pages size.
>>     
> ...
>   
>> So i see in prev tests i make something wrong in time counting
>> So i modify test script and make tests again:
>>
>> echo 64 > /sys/module/fib_trie/parameters/sync_pages
>> Total size reach max in 13 sec
>>
>> echo 128 > /sys/module/fib_trie/parameters/sync_pages
>> Total size reach max in 10 sec
>>
>> echo 256 > /sys/module/fib_trie/parameters/sync_pages
>> Total size reach max in 10 sec
>>
>> And for sync_paqges >256 time for propagating routes is always 10sec.
>>     
>
> So this means sync_pages 128 or 256 is reasonable.
>
>   
>> Also today i have many messages in dmesg like this:
>> Fix inflate_threshold_root. Now=25 size=11 bits
>> :)
>>     
>
>   
> This is something new and a bit surprising to me: the same threshold
> in previous tests didn't generate this? Do you mean more than: 
> "Fix inflate_threshold_root. Now=15 size=11 bits" before?
>
>   
Yes. Sorry for that - this info was not all the day but only 5 minutes 
when i was making tests.
This info was reported only when all iBGP peers was down/up fast.

>> And after tune :
>> /sys/module/fib_trie/parameters/inflate_threshold_root
>> no more info :)
>>     
>
> With what value?
>
>   
When i set 35 as inflate_threshold_root there was no info even if all 
iBGP peers was down/up.

But i start to search when i have info about "Fix inflate_threshold_root"
And i see that the best is set this to 20 for me i have no info then in 
normal router operation / without down/up bgp peers many times in short 
time.

> Pawel, let's say that current defaults are:
> inflate_threshold_root 25 sync_pages 256
>
> I'd like you to try to check if e.g.:
> inflate_threshold_root 15 sync_pages 256
> can give you any visible or subjective difference worth tweaking it
> at all? (These stats from the next messages don't show this enough.)
> You don't need to hurry with this...
>
>   
I will try to make more accurate tests in weekend.

Regards
Paweł Staszewski
> Thanks,
> Jarek P.
>
>
>   


^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-06-30 19:48                         ` David Miller
  2009-06-30 20:14                           ` Jarek Poplawski
@ 2009-07-10 15:29                           ` Stephen Hemminger
  1 sibling, 0 replies; 99+ messages in thread
From: Stephen Hemminger @ 2009-07-10 15:29 UTC (permalink / raw)
  To: David Miller
  Cc: jarkao2, pstaszewski, robert, Robert.Olsson, jorge, dada1,
	robert.olsson, netdev

On Tue, 30 Jun 2009 12:48:49 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Mon, 29 Jun 2009 10:58:20 +0000
> 
> > On Mon, Jun 29, 2009 at 11:51:52AM +0200, Paweł Staszewski wrote:
> >> I apply this patch
> >>
> >> fib_triestats in attached file :)
> >>
> >>> ------------------->
> >>> ipv4: Fix fib_trie rebalancing, part 3
> >>>
> >>> Alas current delaying of freeing old tnodes by RCU in trie_rebalance
> >>> is still not enough because we can free a top tnode before updating a
> >>> t->trie pointer.
> >>>
> >>> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
> >>> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
> >>> ---
> > 
> > David, I guess you could add:
> > 
> > Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
> 
> Done, and applied, thanks Jarek.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

This is probably in kernel bugzilla as well, so someone should
update:
  http://bugzilla.kernel.org/show_bug.cgi?id=6648

-- 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH net-next] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-05 21:32                                                           ` Paul E. McKenney
  2009-07-05 22:23                                                             ` Jarek Poplawski
@ 2009-07-14 18:33                                                             ` Jarek Poplawski
  2009-07-20 14:41                                                               ` David Miller
  2009-07-14 21:20                                                             ` [PATCH net-next] ipv4: fib_trie: Use tnode_get_child_rcu() and node_parent_rcu() in lookups Jarek Poplawski
  2 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-14 18:33 UTC (permalink / raw)
  To: David Miller
  Cc: Paul E. McKenney, Paweł Staszewski,
	Linux Network Development list, Robert Olsson

On Sun, Jul 05, 2009 at 02:32:32PM -0700, Paul E. McKenney wrote:
> On Sun, Jul 05, 2009 at 07:32:08PM +0200, Jarek Poplawski wrote:
> > On Sun, Jul 05, 2009 at 06:20:03PM +0200, Jarek Poplawski wrote:
> > > On Sun, Jul 05, 2009 at 02:30:03AM +0200, Paweł Staszewski wrote:
> > > > Oh
> > > >
> > > > I forgot - please Jarek give me patch with sync rcu and i will make test  
> > > > on preempt kernel
> > > 
> > > Probably non-preempt kernel might need something like this more, but
> > > comparing is always interesting. This patch is based on Paul's
> > > suggestion (I hope).
> > 
> > Hold on ;-) Here is something even better... Syncing after 128 pages
> > might be still too slow, so here is a higher initial value, 1000, plus
> > you can change this while testing in:
> > 
> > /sys/module/fib_trie/parameters/sync_pages
> > 
> > It would be interesting to find the lowest acceptable value.
> 
> Looks like a promising approach to me!
> 
> 							Thanx, Paul

Below is a simpler version of this patch, without the sysfs parameter.
(I left the previous version quoted for comparison.) Thanks.

> > Jarek P.
> > ---> (synchronize take 8; apply on top of the 2.6.29.x with the last
> >  	all-in-one patch, or net-2.6)
> > 
> >  net/ipv4/fib_trie.c |   12 ++++++++++++
> >  1 files changed, 12 insertions(+), 0 deletions(-)
> > 
> > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
> > index 00a54b2..decc8d0 100644
> > --- a/net/ipv4/fib_trie.c
> > +++ b/net/ipv4/fib_trie.c
> > @@ -71,6 +71,7 @@
> >  #include <linux/netlink.h>
> >  #include <linux/init.h>
> >  #include <linux/list.h>
> > +#include <linux/moduleparam.h>
> >  #include <net/net_namespace.h>
> >  #include <net/ip.h>
> >  #include <net/protocol.h>
> > @@ -164,6 +165,10 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
> >  static struct tnode *halve(struct trie *t, struct tnode *tn);
> >  /* tnodes to free after resize(); protected by RTNL */
> >  static struct tnode *tnode_free_head;
> > +static size_t tnode_free_size;
> > +
> > +static int sync_pages __read_mostly = 1000;
> > +module_param(sync_pages, int, 0640);
> > 
> >  static struct kmem_cache *fn_alias_kmem __read_mostly;
> >  static struct kmem_cache *trie_leaf_kmem __read_mostly;
> > @@ -393,6 +398,8 @@ static void tnode_free_safe(struct tnode *tn)
> >  	BUG_ON(IS_LEAF(tn));
> >  	tn->tnode_free = tnode_free_head;
> >  	tnode_free_head = tn;
> > +	tnode_free_size += sizeof(struct tnode) +
> > +			   (sizeof(struct node *) << tn->bits);
> >  }
> > 
> >  static void tnode_free_flush(void)
> > @@ -404,6 +411,11 @@ static void tnode_free_flush(void)
> >  		tn->tnode_free = NULL;
> >  		tnode_free(tn);
> >  	}
> > +
> > +	if (tnode_free_size >= PAGE_SIZE * sync_pages) {
> > +		tnode_free_size = 0;
> > +		synchronize_rcu();
> > +	}
> >  }
> > 
> >  static struct leaf *leaf_new(void)
> > --

------------------------>
ipv4: Use synchronize_rcu() during trie_rebalance()

During trie_rebalance() we free memory after resizing with call_rcu(),
but large updates, especially with PREEMPT_NONE configs, can cause
memory stresses, so this patch calls synchronize_rcu() in
tnode_free_flush() after each sync_pages to guarantee such freeing
(especially before resizing the root node).

The value of sync_pages = 128 is based on Pawel Staszewski's tests as
the lowest which doesn't hinder updating times. (For testing purposes
there was a sysfs module parameter to change it on demand, but it's
removed until we're sure it could be really useful.)

The patch is based on suggestions by: Paul E. McKenney
<paulmck@linux.vnet.ibm.com>

Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

 net/ipv4/fib_trie.c |   15 +++++++++++++++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 63c2fa7..58ba9f4 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -164,6 +164,14 @@ static struct tnode *inflate(struct trie *t, struct tnode *tn);
 static struct tnode *halve(struct trie *t, struct tnode *tn);
 /* tnodes to free after resize(); protected by RTNL */
 static struct tnode *tnode_free_head;
+static size_t tnode_free_size;
+
+/*
+ * synchronize_rcu after call_rcu for that many pages; it should be especially
+ * useful before resizing the root node with PREEMPT_NONE configs; the value was
+ * obtained experimentally, aiming to avoid visible slowdown.
+ */
+static const int sync_pages = 128;
 
 static struct kmem_cache *fn_alias_kmem __read_mostly;
 static struct kmem_cache *trie_leaf_kmem __read_mostly;
@@ -393,6 +401,8 @@ static void tnode_free_safe(struct tnode *tn)
 	BUG_ON(IS_LEAF(tn));
 	tn->tnode_free = tnode_free_head;
 	tnode_free_head = tn;
+	tnode_free_size += sizeof(struct tnode) +
+			   (sizeof(struct node *) << tn->bits);
 }
 
 static void tnode_free_flush(void)
@@ -404,6 +414,11 @@ static void tnode_free_flush(void)
 		tn->tnode_free = NULL;
 		tnode_free(tn);
 	}
+
+	if (tnode_free_size >= PAGE_SIZE * sync_pages) {
+		tnode_free_size = 0;
+		synchronize_rcu();
+	}
 }
 
 static struct leaf *leaf_new(void)

^ permalink raw reply related	[flat|nested] 99+ messages in thread

* [PATCH net-next] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-09 20:34                                                                       ` Paweł Staszewski
@ 2009-07-14 19:41                                                                         ` Jarek Poplawski
  2009-07-15  7:43                                                                           ` Robert Olsson
  2009-07-20 14:41                                                                           ` David Miller
  0 siblings, 2 replies; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-14 19:41 UTC (permalink / raw)
  To: David Miller
  Cc: Paweł Staszewski, Linux Network Development list,
	Robert Olsson, Jorge Boncompte [DTI2]

On Thu, Jul 09, 2009 at 10:34:17PM +0200, Paweł Staszewski wrote:
> Jarek Poplawski pisze:
>> On Wed, Jul 08, 2009 at 12:56:13AM +0200, Paweł Staszewski wrote:
...
>>> Also today i have many messages in dmesg like this:
>>> Fix inflate_threshold_root. Now=25 size=11 bits
>>> :)
>>>     
>>
>>   This is something new and a bit surprising to me: the same threshold
>> in previous tests didn't generate this? Do you mean more than: "Fix 
>> inflate_threshold_root. Now=15 size=11 bits" before?
>>
>>   
> Yes. Sorry for that - this info was not all the day but only 5 minutes  
> when i was making tests.
> This info was reported only when all iBGP peers was down/up fast.
>
>>> And after tune :
>>> /sys/module/fib_trie/parameters/inflate_threshold_root
>>> no more info :)
>>>     
>>
>> With what value?
>>
>>   
> When i set 35 as inflate_threshold_root there was no info even if all  
> iBGP peers was down/up.

So it looks like the patch tested earlier could be still useful; after
changing the inflate_threshold_root it seems these warnings should be
very rare but there is no reason to alarm users with something they
can't fix optimally, anyway.

Thanks,
Jarek P.
--------------------->
ipv4: Fix inflate_threshold_root automatically

During large updates there could be triggered warnings like: "Fix
inflate_threshold_root. Now=25 size=11 bits" if inflate() of the root
node isn't finished in 10 loops. It should be much rarer now, after
changing the threshold from 15 to 25, and a temporary problem, so
this patch tries to handle it automatically using a fix variable to
increase by one inflate threshold for next root resizes (up to the 35
limit, max fix = 10). The fix variable is decreased when root's
inflate() finishes below 7 loops (even if some other, smaller table/
trie is updated -- for simplicity the fix variable is global for now).

Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-07-13 13:32:53.000000000 +0200
+++ b/net/ipv4/fib_trie.c	2009-07-13 15:16:18.000000000 +0200
@@ -327,6 +327,8 @@ static const int inflate_threshold = 50;
 static const int halve_threshold_root = 15;
 static const int inflate_threshold_root = 25;
 
+static int inflate_threshold_root_fix;
+#define INFLATE_FIX_MAX 10	/* a comment in resize() */
 
 static void __alias_free_mem(struct rcu_head *head)
 {
@@ -617,7 +619,8 @@ static struct node *resize(struct trie *
 	/* Keep root node larger  */
 
 	if (!tn->parent)
-		inflate_threshold_use = inflate_threshold_root;
+		inflate_threshold_use = inflate_threshold_root +
+					inflate_threshold_root_fix;
 	else
 		inflate_threshold_use = inflate_threshold;
 
@@ -641,15 +644,27 @@ static struct node *resize(struct trie *
 	}
 
 	if (max_resize < 0) {
-		if (!tn->parent)
-			pr_warning("Fix inflate_threshold_root."
-				   " Now=%d size=%d bits\n",
-				   inflate_threshold_root, tn->bits);
-		else
+		if (!tn->parent) {
+			/*
+			 * It was observed that during large updates even
+			 * inflate_threshold_root = 35 might be needed to avoid
+			 * this warning; but it should be temporary, so let's
+			 * try to handle this automatically.
+			 */
+			if (inflate_threshold_root_fix < INFLATE_FIX_MAX)
+				inflate_threshold_root_fix++;
+			else
+				pr_warning("Fix inflate_threshold_root."
+					   " Now=%d size=%d bits fix=%d\n",
+					   inflate_threshold_root, tn->bits,
+					   inflate_threshold_root_fix);
+		} else {
 			pr_warning("Fix inflate_threshold."
 				   " Now=%d size=%d bits\n",
 				   inflate_threshold, tn->bits);
-	}
+		}
+	} else if (max_resize > 3 && !tn->parent && inflate_threshold_root_fix)
+		inflate_threshold_root_fix--;
 
 	check_tnode(tn);
 

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH net-next] ipv4: fib_trie: Use tnode_get_child_rcu() and node_parent_rcu() in lookups
  2009-07-05 21:32                                                           ` Paul E. McKenney
  2009-07-05 22:23                                                             ` Jarek Poplawski
  2009-07-14 18:33                                                             ` [PATCH net-next] " Jarek Poplawski
@ 2009-07-14 21:20                                                             ` Jarek Poplawski
  2009-07-20 14:41                                                               ` David Miller
  2 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-14 21:20 UTC (permalink / raw)
  To: David Miller
  Cc: Paul E. McKenney, Paweł Staszewski,
	Linux Network Development list, Robert Olsson


While looking for other fib_trie problems reported by Pawel Staszewski
I noticed there are a few uses of tnode_get_child() and node_parent()
in lookups instead of their rcu versions.

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---
(this patch was prepared on top of my 2 today's fib_trie patches)

diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
--- a/net/ipv4/fib_trie.c	2009-07-14 20:40:39.000000000 +0200
+++ b/net/ipv4/fib_trie.c	2009-07-14 22:41:26.000000000 +0200
@@ -1465,7 +1465,7 @@ static int fn_trie_lookup(struct fib_tab
 			cindex = tkey_extract_bits(mask_pfx(key, current_prefix_length),
 						   pos, bits);
 
-		n = tnode_get_child(pn, cindex);
+		n = tnode_get_child_rcu(pn, cindex);
 
 		if (n == NULL) {
 #ifdef CONFIG_IP_FIB_TRIE_STATS
@@ -1600,7 +1600,7 @@ backtrace:
 		if (chopped_off <= pn->bits) {
 			cindex &= ~(1 << (chopped_off-1));
 		} else {
-			struct tnode *parent = node_parent((struct node *) pn);
+			struct tnode *parent = node_parent_rcu((struct node *) pn);
 			if (!parent)
 				goto failed;
 
@@ -1813,7 +1813,7 @@ static struct leaf *trie_firstleaf(struc
 static struct leaf *trie_nextleaf(struct leaf *l)
 {
 	struct node *c = (struct node *) l;
-	struct tnode *p = node_parent(c);
+	struct tnode *p = node_parent_rcu(c);
 
 	if (!p)
 		return NULL;	/* trie with just one leaf */

^ permalink raw reply	[flat|nested] 99+ messages in thread

* [PATCH net-next] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-14 19:41                                                                         ` [PATCH net-next] " Jarek Poplawski
@ 2009-07-15  7:43                                                                           ` Robert Olsson
  2009-07-15 13:05                                                                             ` Jarek Poplawski
  2009-07-20 14:41                                                                           ` David Miller
  1 sibling, 1 reply; 99+ messages in thread
From: Robert Olsson @ 2009-07-15  7:43 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, Paweł Staszewski,
	Linux Network Development list, Robert Olsson,
	Jorge Boncompte [DTI2]


Jarek Poplawski writes:


Looks good. Maybe we're getting close to some generic solution to take 
a very optimistic approach wrt thresholds for root node and adjust to 
settings without the warning. Or maybe now even remove warning totally
with stata counter?

Can we even consider some other different strategy for bumping up the root 
node. 

We need all lookup performance we can get when we now try to route without 
the route cache. And we probably need to evaluate the cost for the multiple 
lookups again at least for LOCAL and MAIN when we talking routing well at 
least straight-forward simple routing. (Semantic change)

I think I've got ~6.2 Gbit/s for simplex forwarding using traffic patterns 
we see in/close to Internet core. This w/o route cache on our hi-end opterons
with 8 CPU cores using niu and ixgbe. I'll test again and your patches when
I'm back from vacation.

Cheers
					--ro

 > So it looks like the patch tested earlier could be still useful; after
 > changing the inflate_threshold_root it seems these warnings should be
 > very rare but there is no reason to alarm users with something they
 > can't fix optimally, anyway.
 > 
 > Thanks,
 > Jarek P.
 > --------------------->
 > ipv4: Fix inflate_threshold_root automatically
 > 
 > During large updates there could be triggered warnings like: "Fix
 > inflate_threshold_root. Now=25 size=11 bits" if inflate() of the root
 > node isn't finished in 10 loops. It should be much rarer now, after
 > changing the threshold from 15 to 25, and a temporary problem, so
 > this patch tries to handle it automatically using a fix variable to
 > increase by one inflate threshold for next root resizes (up to the 35
 > limit, max fix = 10). The fix variable is decreased when root's
 > inflate() finishes below 7 loops (even if some other, smaller table/
 > trie is updated -- for simplicity the fix variable is global for now).
 > 
 > Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
 > Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
 > Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
 > Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
 > ---
 > 
 > diff -Nurp a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
 > --- a/net/ipv4/fib_trie.c	2009-07-13 13:32:53.000000000 +0200
 > +++ b/net/ipv4/fib_trie.c	2009-07-13 15:16:18.000000000 +0200
 > @@ -327,6 +327,8 @@ static const int inflate_threshold = 50;
 >  static const int halve_threshold_root = 15;
 >  static const int inflate_threshold_root = 25;
 >  
 > +static int inflate_threshold_root_fix;
 > +#define INFLATE_FIX_MAX 10	/* a comment in resize() */
 >  
 >  static void __alias_free_mem(struct rcu_head *head)
 >  {
 > @@ -617,7 +619,8 @@ static struct node *resize(struct trie *
 >  	/* Keep root node larger  */
 >  
 >  	if (!tn->parent)
 > -		inflate_threshold_use = inflate_threshold_root;
 > +		inflate_threshold_use = inflate_threshold_root +
 > +					inflate_threshold_root_fix;
 >  	else
 >  		inflate_threshold_use = inflate_threshold;
 >  
 > @@ -641,15 +644,27 @@ static struct node *resize(struct trie *
 >  	}
 >  
 >  	if (max_resize < 0) {
 > -		if (!tn->parent)
 > -			pr_warning("Fix inflate_threshold_root."
 > -				   " Now=%d size=%d bits\n",
 > -				   inflate_threshold_root, tn->bits);
 > -		else
 > +		if (!tn->parent) {
 > +			/*
 > +			 * It was observed that during large updates even
 > +			 * inflate_threshold_root = 35 might be needed to avoid
 > +			 * this warning; but it should be temporary, so let's
 > +			 * try to handle this automatically.
 > +			 */
 > +			if (inflate_threshold_root_fix < INFLATE_FIX_MAX)
 > +				inflate_threshold_root_fix++;
 > +			else
 > +				pr_warning("Fix inflate_threshold_root."
 > +					   " Now=%d size=%d bits fix=%d\n",
 > +					   inflate_threshold_root, tn->bits,
 > +					   inflate_threshold_root_fix);
 > +		} else {
 >  			pr_warning("Fix inflate_threshold."
 >  				   " Now=%d size=%d bits\n",
 >  				   inflate_threshold, tn->bits);
 > -	}
 > +		}
 > +	} else if (max_resize > 3 && !tn->parent && inflate_threshold_root_fix)
 > +		inflate_threshold_root_fix--;
 >  
 >  	check_tnode(tn);
 >  

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-next] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-15  7:43                                                                           ` Robert Olsson
@ 2009-07-15 13:05                                                                             ` Jarek Poplawski
  2009-07-17  8:08                                                                               ` Robert Olsson
  0 siblings, 1 reply; 99+ messages in thread
From: Jarek Poplawski @ 2009-07-15 13:05 UTC (permalink / raw)
  To: Robert Olsson
  Cc: David Miller, Paweł Staszewski,
	Linux Network Development list, Robert Olsson,
	Jorge Boncompte [DTI2]

On Wed, Jul 15, 2009 at 09:43:11AM +0200, Robert Olsson wrote:
> 
> Jarek Poplawski writes:
> 
> 
> Looks good. Maybe we're getting close to some generic solution to take 
> a very optimistic approach wrt thresholds for root node and adjust to 
> settings without the warning. Or maybe now even remove warning totally
> with stata counter?

I guess, we could, but maybe let's wait a bit to make sure there is
nothing surprising?

> 
> Can we even consider some other different strategy for bumping up the root 
> node. 
> 
> We need all lookup performance we can get when we now try to route without 
> the route cache. And we probably need to evaluate the cost for the multiple 
> lookups again at least for LOCAL and MAIN when we talking routing well at 
> least straight-forward simple routing. (Semantic change)
> 
> I think I've got ~6.2 Gbit/s for simplex forwarding using traffic patterns 
> we see in/close to Internet core. This w/o route cache on our hi-end opterons
> with 8 CPU cores using niu and ixgbe. I'll test again and your patches when
> I'm back from vacation.
> 

Sure, I was mainly aiming at safe defaults (wrt. memory usage), but if
tests show there is a better strategy we should go for it.

Thanks,
Jarek P.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-next] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-15 13:05                                                                             ` Jarek Poplawski
@ 2009-07-17  8:08                                                                               ` Robert Olsson
  0 siblings, 0 replies; 99+ messages in thread
From: Robert Olsson @ 2009-07-17  8:08 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Robert Olsson, David Miller, Paweł Staszewski,
	Linux Network Development list, Jorge Boncompte [DTI2]


Jarek Poplawski writes:
 > On Wed, Jul 15, 2009 at 09:43:11AM +0200, Robert Olsson wrote:

 > > a very optimistic approach wrt thresholds for root node and adjust to 
 > > settings without the warning. Or maybe now even remove warning totally
 > > with stata counter?
 > 
 > I guess, we could, but maybe let's wait a bit to make sure there is
 > nothing surprising?

Yes if Pawel is running it we we'll get reports. I've no chance to upgrade
any of our routers now.  I've seen this printout in one our routers but we 
don't do "clear ip bgp *" to often and besides we try to use soft re-
configuration inbound.


 > > I think I've got ~6.2 Gbit/s for simplex forwarding using traffic patterns 
 > > we see in/close to Internet core. This w/o route cache on our hi-end opterons
 > > with 8 CPU cores using niu and ixgbe. I'll test again and your patches when
 > > I'm back from vacation.
 > > 

 > Sure, I was mainly aiming at safe defaults (wrt. memory usage), but if
 > tests show there is a better strategy we should go for it.

Routing without route cache is "new" area probably for minority of systems
were caching is not possible. Read BGP routers in core. 

Yes we should have safe defults. Thanks for all your work.

Signed-off-by: Robert Olsson <robert.olsson@its.uu.se>

Cheers 
					--ro

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-next] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-14 18:33                                                             ` [PATCH net-next] " Jarek Poplawski
@ 2009-07-20 14:41                                                               ` David Miller
  0 siblings, 0 replies; 99+ messages in thread
From: David Miller @ 2009-07-20 14:41 UTC (permalink / raw)
  To: jarkao2; +Cc: paulmck, pstaszewski, netdev, robert

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 14 Jul 2009 20:33:08 +0200

> ipv4: Use synchronize_rcu() during trie_rebalance()
> 
> During trie_rebalance() we free memory after resizing with call_rcu(),
> but large updates, especially with PREEMPT_NONE configs, can cause
> memory stresses, so this patch calls synchronize_rcu() in
> tnode_free_flush() after each sync_pages to guarantee such freeing
> (especially before resizing the root node).
> 
> The value of sync_pages = 128 is based on Pawel Staszewski's tests as
> the lowest which doesn't hinder updating times. (For testing purposes
> there was a sysfs module parameter to change it on demand, but it's
> removed until we're sure it could be really useful.)
> 
> The patch is based on suggestions by: Paul E. McKenney
> <paulmck@linux.vnet.ibm.com>
> 
> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
> Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

Applied.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-next] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits
  2009-07-14 19:41                                                                         ` [PATCH net-next] " Jarek Poplawski
  2009-07-15  7:43                                                                           ` Robert Olsson
@ 2009-07-20 14:41                                                                           ` David Miller
  1 sibling, 0 replies; 99+ messages in thread
From: David Miller @ 2009-07-20 14:41 UTC (permalink / raw)
  To: jarkao2; +Cc: pstaszewski, netdev, robert, jorge

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 14 Jul 2009 21:41:00 +0200

> ipv4: Fix inflate_threshold_root automatically
> 
> During large updates there could be triggered warnings like: "Fix
> inflate_threshold_root. Now=25 size=11 bits" if inflate() of the root
> node isn't finished in 10 loops. It should be much rarer now, after
> changing the threshold from 15 to 25, and a temporary problem, so
> this patch tries to handle it automatically using a fix variable to
> increase by one inflate threshold for next root resizes (up to the 35
> limit, max fix = 10). The fix variable is decreased when root's
> inflate() finishes below 7 loops (even if some other, smaller table/
> trie is updated -- for simplicity the fix variable is global for now).
> 
> Reported-by: Pawel Staszewski <pstaszewski@itcare.pl>
> Reported-by: Jorge Boncompte [DTI2] <jorge@dti2.net>
> Tested-by: Pawel Staszewski <pstaszewski@itcare.pl>
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

Applied.

^ permalink raw reply	[flat|nested] 99+ messages in thread

* Re: [PATCH net-next] ipv4: fib_trie: Use tnode_get_child_rcu() and node_parent_rcu() in lookups
  2009-07-14 21:20                                                             ` [PATCH net-next] ipv4: fib_trie: Use tnode_get_child_rcu() and node_parent_rcu() in lookups Jarek Poplawski
@ 2009-07-20 14:41                                                               ` David Miller
  0 siblings, 0 replies; 99+ messages in thread
From: David Miller @ 2009-07-20 14:41 UTC (permalink / raw)
  To: jarkao2; +Cc: paulmck, pstaszewski, netdev, robert

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 14 Jul 2009 23:20:32 +0200

> 
> While looking for other fib_trie problems reported by Pawel Staszewski
> I noticed there are a few uses of tnode_get_child() and node_parent()
> in lookups instead of their rcu versions.
> 
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

Applied.

^ permalink raw reply	[flat|nested] 99+ messages in thread

end of thread, other threads:[~2009-07-20 14:41 UTC | newest]

Thread overview: 99+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-25 15:48 rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits Paweł Staszewski
2009-06-25 21:19 ` Eric Dumazet
2009-06-25 21:52   ` Paweł Staszewski
2009-06-25 22:54     ` Eric Dumazet
2009-06-26 10:06       ` Paweł Staszewski
2009-06-26 10:34         ` Eric Dumazet
2009-06-26 10:47           ` Paweł Staszewski
2009-06-26 10:52             ` Eric Dumazet
2009-06-26 17:26               ` Paweł Staszewski
2009-06-26  8:03   ` Jarek Poplawski
2009-06-26  9:19     ` Robert Olsson
2009-06-26  9:37       ` Jarek Poplawski
2009-06-26 10:26         ` Jorge Boncompte [DTI2]
2009-06-26 12:42         ` Robert Olsson
2009-06-26 12:54           ` Jarek Poplawski
2009-06-26 13:28             ` Jarek Poplawski
2009-06-26 13:52               ` Robert Olsson
2009-06-26 15:10                 ` Jarek Poplawski
2009-06-26 15:30                   ` Paul E. McKenney
2009-06-26 15:54                     ` Jarek Poplawski
2009-06-26 16:15                       ` Jarek Poplawski
2009-06-26 16:23                         ` Paul E. McKenney
2009-06-26 16:45                           ` Jarek Poplawski
2009-06-26 17:05                             ` Paul E. McKenney
2009-06-26 18:05                               ` Jarek Poplawski
2009-06-26 18:21                                 ` Paul E. McKenney
2009-06-26 20:19                                   ` Jarek Poplawski
2009-06-26 20:26                                 ` Robert Olsson
2009-06-26 20:37                                   ` Jarek Poplawski
2009-06-26 21:20                                     ` Jarek Poplawski
2009-06-27 19:20       ` Jarek Poplawski
2009-06-27 20:51         ` Jarek Poplawski
2009-06-28  0:28           ` Paweł Staszewski
2009-06-28 11:11           ` Robert Olsson
2009-06-29  7:57             ` Paweł Staszewski
2009-06-28 11:04         ` Robert Olsson
2009-06-28 12:03           ` Jarek Poplawski
2009-06-28 14:35           ` Jarek Poplawski
2009-06-28 15:32             ` Paweł Staszewski
2009-06-28 15:48               ` Paweł Staszewski
2009-06-28 19:56                 ` Jarek Poplawski
2009-06-28 21:36                 ` Jarek Poplawski
2009-06-29  8:08                   ` Paweł Staszewski
2009-06-29  8:47                     ` Paweł Staszewski
2009-06-29  9:27                       ` Jarek Poplawski
2009-06-29  9:43                         ` Paweł Staszewski
2009-06-29  8:33                   ` [PATCH net-2.6] " Jarek Poplawski
2009-06-29  9:51                     ` Paweł Staszewski
2009-06-29 10:47                       ` Jarek Poplawski
2009-06-29 16:24                         ` Paweł Staszewski
2009-06-29 17:09                           ` Jarek Poplawski
2009-06-30  7:09                         ` Jarek Poplawski
2009-06-30 20:16                           ` Paweł Staszewski
2009-06-30 20:41                             ` Jarek Poplawski
2009-06-30 23:31                               ` Paweł Staszewski
2009-07-01  6:36                                 ` Jarek Poplawski
     [not found]                                   ` <20090701072409.GA12592@ff.dom.local>
2009-07-01  9:43                                     ` Paweł Staszewski
2009-07-01  9:50                                       ` Paweł Staszewski
2009-07-01 10:13                                       ` Jarek Poplawski
2009-07-01 11:04                                         ` Jarek Poplawski
2009-07-01 22:17                                           ` Paweł Staszewski
2009-07-02  5:32                                             ` Jarek Poplawski
2009-07-02  5:43                                               ` Paweł Staszewski
2009-07-02  6:00                                                 ` Jarek Poplawski
2009-07-02 15:31                                                   ` Robert Olsson
2009-07-02 19:06                                                     ` Jarek Poplawski
2009-07-02 21:32                                                       ` Robert Olsson
2009-07-02 22:13                                                         ` Jarek Poplawski
2009-07-05  0:26                                                   ` Paweł Staszewski
2009-07-05  0:30                                                     ` Paweł Staszewski
2009-07-05 16:20                                                       ` Jarek Poplawski
2009-07-05 17:32                                                         ` Jarek Poplawski
2009-07-05 21:32                                                           ` Paul E. McKenney
2009-07-05 22:23                                                             ` Jarek Poplawski
2009-07-05 23:53                                                               ` Paweł Staszewski
2009-07-06  9:02                                                                 ` Jarek Poplawski
2009-07-07 22:56                                                                   ` Paweł Staszewski
2009-07-07 23:50                                                                     ` Jarek Poplawski
2009-07-09 20:34                                                                       ` Paweł Staszewski
2009-07-14 19:41                                                                         ` [PATCH net-next] " Jarek Poplawski
2009-07-15  7:43                                                                           ` Robert Olsson
2009-07-15 13:05                                                                             ` Jarek Poplawski
2009-07-17  8:08                                                                               ` Robert Olsson
2009-07-20 14:41                                                                           ` David Miller
2009-07-07 23:23                                                                   ` [PATCH net-2.6] " Paweł Staszewski
2009-07-07 23:30                                                                     ` Paweł Staszewski
2009-07-14 18:33                                                             ` [PATCH net-next] " Jarek Poplawski
2009-07-20 14:41                                                               ` David Miller
2009-07-14 21:20                                                             ` [PATCH net-next] ipv4: fib_trie: Use tnode_get_child_rcu() and node_parent_rcu() in lookups Jarek Poplawski
2009-07-20 14:41                                                               ` David Miller
2009-07-05  0:31                                                     ` [PATCH net-2.6] Re: rib_trie / Fix inflate_threshold_root. Now=15 size=11 bits Paweł Staszewski
2009-07-05 12:56                                                     ` [PATCH -stable] " Jarek Poplawski
2009-07-05 13:08                                                     ` [PATCH v2 " Jarek Poplawski
2009-07-08  2:42                                                       ` David Miller
2009-07-08  6:44                                                         ` Jarek Poplawski
2009-06-29 10:58                       ` [PATCH net-2.6] " Jarek Poplawski
2009-06-30 19:48                         ` David Miller
2009-06-30 20:14                           ` Jarek Poplawski
2009-07-10 15:29                           ` Stephen Hemminger

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.