All of lore.kernel.org
 help / color / mirror / Atom feed
* this_cpu_xx's patchset effect on SLUB cycle counts
@ 2009-10-13 20:20 Christoph Lameter
  2009-10-14  3:12 ` David Rientjes
  0 siblings, 1 reply; 3+ messages in thread
From: Christoph Lameter @ 2009-10-13 20:20 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-kernel, Pekka Enberg, Tejun Heo, David Rientjes, Mathieu Desnoyers


The recent this_cpu_xx patchsets have allowed an increase in the
effectiveness of the allocation fastpath in SLUB by avoiding lookups and
interrupt disable. The approaches likely can be also applied to other allocators.

Measurements were done using the in kernel page allocator benchmarks that
were also posted today. I hope that these numbers can lead to an
evaluation of how useful the this_cpu_xx operations are and how to most
effectively apply them in the kernel.

The following kernels were run:

A. Upstream with Tejun's for-next tree (this include this_cpu_xx base
functionality but not the enhancements to the page allocator and rework of
slubs fastpath)

B. Kernel A with the page allocator and slub enhancements (including the
one titled "aggressive use of this_cpu_xx").

C. Kernel B with the slub irqless patch on top.

Note that B and C are improving only the fastpath of the SLUB allocator.
They do not affect slowpath nor page allocator fallback. Well not entirely
true: C especially adds code to the slowpath. Question is if that offsets
the gains in the fastpath


The following tests were run:

1. Single threaded testing

Single thread is running performing allocation and frees. The first test
does a large number of allocs and then a large number of frees. The second
test performs a single alloc followed by a single free a large number of
times. The same object is reused in the second test which allow use of
the fastpath for alloc and free. The first test
requires periodic fallback to the slowpath on alloc and almost constant
fallback to the slowpath on free.

2. Concurrent allocations

Allocations are performed concurrently on all cpus. The first test
performns a large number of allocs followed by a large number of frees and
the second (like under 1) follows each alloc with a free.

The remote free tests frees all objects on different processors than where
they were allocated.

For details on the test: Please look at todays posting of the source code
for the testing modules.

Results for kernel A
--------------------

Linux version 2.6.32-rc4-00027-gceb8d11 (gcc version 4.3.4 (Debian 4.3.4-5) ) #7 SMP Tue Oct 13 13:55:52 CDT 2009
SLUB: Genslabs=14, HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=2
Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 239 cycles kfree -> 261 cycles
10000 times kmalloc(16) -> 249 cycles kfree -> 208 cycles
10000 times kmalloc(32) -> 215 cycles kfree -> 232 cycles
10000 times kmalloc(64) -> 164 cycles kfree -> 216 cycles
10000 times kmalloc(128) -> 266 cycles kfree -> 275 cycles
10000 times kmalloc(256) -> 478 cycles kfree -> 199 cycles
10000 times kmalloc(512) -> 449 cycles kfree -> 201 cycles
10000 times kmalloc(1024) -> 484 cycles kfree -> 398 cycles
10000 times kmalloc(2048) -> 475 cycles kfree -> 559 cycles
10000 times kmalloc(4096) -> 792 cycles kfree -> 506 cycles
10000 times kmalloc(8192) -> 753 cycles kfree -> 679 cycles
10000 times kmalloc(16384) -> 968 cycles kfree -> 712 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 292 cycles
10000 times kmalloc(16)/kfree -> 308 cycles
10000 times kmalloc(32)/kfree -> 326 cycles
10000 times kmalloc(64)/kfree -> 303 cycles
10000 times kmalloc(128)/kfree -> 257 cycles
10000 times kmalloc(256)/kfree -> 262 cycles
10000 times kmalloc(512)/kfree -> 293 cycles
10000 times kmalloc(1024)/kfree -> 262 cycles
10000 times kmalloc(2048)/kfree -> 289 cycles
10000 times kmalloc(4096)/kfree -> 274 cycles
10000 times kmalloc(8192)/kfree -> 265 cycles
10000 times kmalloc(16384)/kfree -> 1041 cycles
Concurrent allocs
=================
Kmalloc N*alloc N*free(8): 0=172/168 1=173/176 2=173/169 3=170/165 4=167/166 5=172/168 6=173/167 7=170/172 8=172/166 9=171/171 10=171/171 11=169/166 12=169/167 13=172/168 14=171/169 15=171/166 Average=171/168
Kmalloc N*alloc N*free(16): 0=185/175 1=181/176 2=187/174 3=183/171 4=186/177 5=183/171 6=187/174 7=181/173 8=184/175 9=181/174 10=184/173 11=181/175 12=185/178 13=182/175 14=184/173 15=180/170 Average=183/174
Kmalloc N*alloc N*free(32): 0=201/185 1=205/189 2=200/183 3=202/178 4=198/180 5=202/177 6=201/183 7=201/181 8=201/185 9=200/185 10=199/182 11=200/177 12=199/183 13=204/177 14=199/184 15=203/178 Average=201/182
Kmalloc N*alloc N*free(64): 0=239/216 1=234/196 2=243/214 3=244/197 4=241/216 5=241/204 6=240/213 7=235/198 8=241/217 9=237/192 10=240/213 11=243/198 12=243/219 13=242/205 14=243/215 15=236/195 Average=240/207
Kmalloc N*alloc N*free(128): 0=405/342 1=346/303 2=402/346 3=346/303 4=403/353 5=344/306 6=401/340 7=346/314 8=403/348 9=344/306 10=398/342 11=344/309 12=407/337 13=347/312 14=402/349 15=344/302 Average=374/326
Kmalloc N*alloc N*free(256): 0=607/594 1=444/455 2=490/588 3=440/461 4=494/577 5=447/454 6=497/585 7=444/446 8=599/587 9=444/454 10=491/585 11=444/454 12=490/584 13=443/446 14=494/586 15=445/457 Average=482/520
Kmalloc N*alloc N*free(512): 0=419/683 1=419/428 2=419/561 3=420/435 4=422/566 5=433/448 6=423/566 7=432/445 8=424/670 9=430/448 10=426/565 11=428/451 12=429/574 13=438/472 14=430/576 15=440/468 Average=427/522
Kmalloc N*alloc N*free(1024): 0=399/377 1=381/373 2=399/373 3=383/374 4=399/377 5=381/378 6=399/377 7=382/372 8=397/376 9=382/376 10=398/375 11=384/374 12=400/375 13=379/375 14=400/374 15=384/374 Average=390/375
Kmalloc N*alloc N*free(2048): 0=713/446 1=514/444 2=600/446 3=512/445 4=599/449 5=512/440 6=605/446 7=510/441 8=704/446 9=511/441 10=601/443 11=512/442 12=598/449 13=512/441 14=605/445 15=511/440 Average=570/444
Kmalloc N*alloc N*free(4096): 0=972/1487 1=810/753 2=942/1308 3=808/758 4=944/1306 5=806/762 6=940/1309 7=807/753 8=968/1469 9=811/756 10=939/1305 11=807/757 12=943/1305 13=807/758 14=942/1307 15=812/758 Average=879/1053
Kmalloc N*(alloc free)(8): 0=252 1=251 2=254 3=252 4=251 5=251 6=252 7=252 8=252 9=251 10=254 11=252 12=251 13=251 14=252 15=252 Average=252
Kmalloc N*(alloc free)(16): 0=251 1=251 2=250 3=251 4=252 5=251 6=252 7=249 8=250 9=251 10=250 11=251 12=252 13=252 14=252 15=250 Average=251
Kmalloc N*(alloc free)(32): 0=252 1=254 2=250 3=255 4=251 5=254 6=250 7=251 8=251 9=251 10=250 11=254 12=251 13=253 14=250 15=254 Average=252
Kmalloc N*(alloc free)(64): 0=252 1=261 2=253 3=263 4=253 5=264 6=253 7=263 8=253 9=261 10=254 11=262 12=252 13=263 14=252 15=262 Average=258
Kmalloc N*(alloc free)(128): 0=252 1=261 2=250 3=250 4=253 5=265 6=252 7=263 8=252 9=261 10=250 11=250 12=253 13=264 14=251 15=263 Average=256
Kmalloc N*(alloc free)(256): 0=251 1=249 2=251 3=251 4=248 5=249 6=248 7=249 8=250 9=248 10=248 11=263 12=248 13=249 14=247 15=250 Average=250
Kmalloc N*(alloc free)(512): 0=250 1=251 2=245 3=250 4=250 5=252 6=250 7=250 8=249 9=250 10=245 11=250 12=250 13=253 14=250 15=251 Average=250
Kmalloc N*(alloc free)(1024): 0=254 1=250 2=250 3=247 4=251 5=248 6=252 7=248 8=253 9=251 10=250 11=247 12=250 13=249 14=250 15=248 Average=250
Kmalloc N*(alloc free)(2048): 0=250 1=256 2=250 3=254 4=272 5=253 6=253 7=251 8=249 9=254 10=250 11=267 12=272 13=252 14=254 15=254 Average=256
Kmalloc N*(alloc free)(4096): 0=248 1=250 2=250 3=250 4=248 5=250 6=250 7=263 8=247 9=249 10=250 11=248 12=248 13=250 14=250 15=259 Average=251
Remote free test
================
N*remote free(8): 0=5/3647 1=174/0 2=172/0 3=171/0 4=177/0 5=176/0 6=175/0 7=176/0 8=112/0 9=175/0 10=175/0 11=175/0 12=176/0 13=175/0 14=176/0 15=175/0 Average=160/228
N*remote free(16): 0=5/2805 1=188/0 2=188/0 3=187/0 4=189/0 5=187/0 6=189/0 7=186/0 8=121/0 9=186/0 10=188/0 11=186/0 12=187/0 13=187/0 14=187/0 15=187/0 Average=172/175
N*remote free(32): 0=4/3106 1=203/0 2=206/0 3=203/0 4=201/0 5=203/0 6=200/0 7=204/0 8=140/0 9=203/0 10=205/0 11=205/0 12=205/0 13=206/0 14=204/0 15=206/0 Average=187/194
N*remote free(64): 0=4/3595 1=262/0 2=264/0 3=259/0 4=263/0 5=259/0 6=260/0 7=258/0 8=190/0 9=255/0 10=261/0 11=259/0 12=259/0 13=254/0 14=255/0 15=257/0 Average=239/224
N*remote free(128): 0=4/5423 1=368/0 2=390/0 3=361/0 4=400/0 5=376/0 6=390/0 7=362/0 8=315/0 9=369/0 10=394/0 11=364/0 12=399/0 13=373/0 14=394/0 15=364/0 Average=351/339
N*remote free(256): 0=3/9422 1=435/0 2=459/0 3=426/0 4=453/0 5=431/0 6=455/0 7=429/0 8=374/0 9=434/0 10=459/0 11=425/0 12=459/0 13=436/0 14=458/0 15=434/0 Average=411/588
N*remote free(512): 0=4/8615 1=427/0 2=418/0 3=431/0 4=425/0 5=438/0 6=424/0 7=438/0 8=382/0 9=432/0 10=428/0 11=434/0 12=429/0 13=442/0 14=427/0 15=444/0 Average=401/538
N*remote free(1024): 0=4/9794 1=411/0 2=399/0 3=409/0 4=401/0 5=404/0 6=398/0 7=411/0 8=351/0 9=410/0 10=400/0 11=409/0 12=401/0 13=407/0 14=402/0 15=409/0 Average=377/612
N*remote free(2048): 0=4/10466 1=532/0 2=606/0 3=532/0 4=606/0 5=536/0 6=602/0 7=536/0 8=532/0 9=533/0 10=605/0 11=532/0 12=604/0 13=534/0 14=602/0 15=535/0 Average=527/654
N*remote free(4096): 0=4/12602 1=839/0 2=931/0 3=832/0 4=926/0 5=834/0 6=932/0 7=834/0 8=827/0 9=841/0 10=933/0 11=835/0 12=929/0 13=834/0 14=937/0 15=839/0 Average=819/787
1 alloc N free test
===================
1 alloc N free(8): 0=3596 1=940 2=942 3=955 4=934 5=966 6=934 7=969 8=953 9=964 10=934 11=947 12=937 13=966 14=941 15=969 Average=1115
1 alloc N free(16): 0=4365 1=1078 2=1065 3=1068 4=1061 5=1068 6=1059 7=1064 8=1082 9=1082 10=1067 11=1073 12=1064 13=1067 14=1058 15=1063 Average=1274
1 alloc N free(32): 0=4193 1=1001 2=1004 3=1010 4=1005 5=1006 6=1007 7=1010 8=1009 9=1002 10=1001 11=1006 12=1008 13=1001 14=1006 15=1010 Average=1205
1 alloc N free(64): 0=4961 1=1209 2=1209 3=1208 4=1205 5=1209 6=1206 7=1207 8=1208 9=1206 10=1207 11=1206 12=1205 13=1206 14=1207 15=1208 Average=1442
1 alloc N free(128): 0=7100 1=1413 2=1413 3=1412 4=1416 5=1414 6=1412 7=1412 8=1413 9=1413 10=1412 11=1414 12=1412 13=1414 14=1413 15=1412 Average=1768
1 alloc N free(256): 0=9157 1=1321 2=1318 3=1318 4=1319 5=1321 6=1320 7=1319 8=1321 9=1320 10=1319 11=1317 12=1319 13=1320 14=1320 15=1319 Average=1809
1 alloc N free(512): 0=9415 1=826 2=824 3=823 4=824 5=823 6=824 7=829 8=828 9=826 10=827 11=826 12=826 13=825 14=825 15=824 Average=1362
1 alloc N free(1024): 0=8331 1=847 2=849 3=847 4=848 5=847 6=848 7=847 8=847 9=848 10=848 11=846 12=847 13=847 14=846 15=846 Average=1315
1 alloc N free(2048): 0=9732 1=858 2=858 3=859 4=858 5=859 6=858 7=858 8=857 9=858 10=858 11=857 12=858 13=858 14=857 15=857 Average=1413
1 alloc N free(4096): 0=12370 1=944 2=944 3=944 4=944 5=944 6=944 7=941 8=943 9=943 10=944 11=942 12=943 13=943 14=943 15=944 Average=1658

Results for kernel B (this_cpu_xx optimized fastpath):
------------------------------------------------------

Linux version 2.6.32-rc4-00027-gceb8d11-dirty (gcc version 4.3.4 (Debian 4.3.4-5) ) #6 SMP Tue Oct 13 13:44:47 CDT 2009
SLUB: Genslabs=14, HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=2
Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 134 cycles kfree -> 212 cycles
10000 times kmalloc(16) -> 109 cycles kfree -> 116 cycles
10000 times kmalloc(32) -> 157 cycles kfree -> 231 cycles
10000 times kmalloc(64) -> 168 cycles kfree -> 169 cycles
10000 times kmalloc(128) -> 263 cycles kfree -> 260 cycles
10000 times kmalloc(256) -> 430 cycles kfree -> 251 cycles
10000 times kmalloc(512) -> 415 cycles kfree -> 258 cycles
10000 times kmalloc(1024) -> 406 cycles kfree -> 432 cycles
10000 times kmalloc(2048) -> 457 cycles kfree -> 579 cycles
10000 times kmalloc(4096) -> 624 cycles kfree -> 553 cycles
10000 times kmalloc(8192) -> 851 cycles kfree -> 851 cycles
10000 times kmalloc(16384) -> 907 cycles kfree -> 722 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 232 cycles
10000 times kmalloc(16)/kfree -> 150 cycles
10000 times kmalloc(32)/kfree -> 278 cycles
10000 times kmalloc(64)/kfree -> 263 cycles
10000 times kmalloc(128)/kfree -> 280 cycles
10000 times kmalloc(256)/kfree -> 279 cycles
10000 times kmalloc(512)/kfree -> 299 cycles
10000 times kmalloc(1024)/kfree -> 289 cycles
10000 times kmalloc(2048)/kfree -> 288 cycles
10000 times kmalloc(4096)/kfree -> 321 cycles
10000 times kmalloc(8192)/kfree -> 285 cycles
10000 times kmalloc(16384)/kfree -> 1002 cycles
Concurrent allocs
=================
Kmalloc N*alloc N*free(8): 0=174/191 1=172/180 2=173/191 3=176/179 4=172/190 5=172/182 6=172/190 7=173/182 8=172/191 9=173/191 10=172/191 11=173/191 12=175/190 13=173/183 14=173/191 15=175/183 Average=173/187
Kmalloc N*alloc N*free(16): 0=181/190 1=184/194 2=183/189 3=186/189 4=185/189 5=185/190 6=184/190 7=187/188 8=179/189 9=184/190 10=182/189 11=182/192 12=184/190 13=181/188 14=183/189 15=184/190 Average=183/190
Kmalloc N*alloc N*free(32): 0=195/345 1=179/242 2=201/270 3=181/239 4=201/270 5=183/241 6=199/270 7=182/240 8=196/283 9=185/237 10=198/270 11=180/238 12=201/271 13=181/240 14=200/272 15=181/239 Average=190/260
Kmalloc N*alloc N*free(64): 0=217/450 1=216/362 2=219/453 3=213/355 4=220/449 5=210/361 6=224/448 7=213/359 8=222/452 9=216/358 10=220/454 11=211/357 12=220/450 13=213/362 14=225/451 15=216/360 Average=217/405
Kmalloc N*alloc N*free(128): 0=421/688 1=348/440 2=423/593 3=356/421 4=419/587 5=355/438 6=418/590 7=345/431 8=418/675 9=353/424 10=421/587 11=355/440 12=419/589 13=356/446 14=421/577 15=356/437 Average=386/523
Kmalloc N*alloc N*free(256): 0=478/880 1=464/675 2=476/847 3=471/673 4=473/845 5=463/679 6=473/841 7=466/676 8=479/871 9=467/669 10=476/848 11=473/674 12=473/845 13=465/664 14=471/847 15=465/666 Average=471/763
Kmalloc N*alloc N*free(512): 0=448/628 1=454/550 2=450/574 3=455/541 4=446/576 5=452/557 6=447/575 7=454/547 8=445/591 9=453/555 10=446/577 11=457/542 12=446/573 13=454/550 14=447/572 15=455/553 Average=450/566
Kmalloc N*alloc N*free(1024): 0=569/707 1=501/624 2=542/694 3=501/624 4=533/695 5=489/624 6=544/695 7=502/617 8=550/705 9=501/624 10=543/693 11=500/617 12=534/695 13=489/619 14=544/693 15=502/619 Average=521/659
Kmalloc N*alloc N*free(2048): 0=466/1246 1=474/856 2=465/1151 3=473/866 4=465/1169 5=474/860 6=466/1170 7=475/838 8=466/1240 9=474/852 10=466/1153 11=475/855 12=467/1154 13=475/851 14=467/1151 15=475/844 Average=470/1016
Kmalloc N*alloc N*free(4096): 0=841/794 1=790/778 2=839/796 3=789/781 4=838/795 5=790/777 6=843/798 7=787/777 8=841/795 9=789/781 10=839/798 11=792/777 12=838/800 13=791/776 14=840/801 15=788/781 Average=815/788
Kmalloc N*(alloc free)(8): 0=245 1=244 2=242 3=261 4=247 5=247 6=243 7=246 8=244 9=243 10=242 11=261 12=247 13=248 14=244 15=245 Average=247
Kmalloc N*(alloc free)(16): 0=248 1=247 2=248 3=243 4=247 5=247 6=242 7=256 8=247 9=246 10=247 11=242 12=247 13=247 14=242 15=257 Average=247
Kmalloc N*(alloc free)(32): 0=243 1=260 2=254 3=243 4=243 5=242 6=247 7=264 8=242 9=259 10=253 11=243 12=243 13=242 14=247 15=265 Average=250
Kmalloc N*(alloc free)(64): 0=244 1=248 2=251 3=244 4=248 5=249 6=247 7=247 8=243 9=247 10=251 11=244 12=248 13=249 14=247 15=248 Average=247
Kmalloc N*(alloc free)(128): 0=253 1=259 2=257 3=261 4=252 5=257 6=253 7=256 8=252 9=256 10=256 11=259 12=252 13=257 14=252 15=256 Average=255
Kmalloc N*(alloc free)(256): 0=241 1=241 2=244 3=241 4=250 5=250 6=244 7=246 8=239 9=240 10=241 11=240 12=250 13=250 14=243 15=247 Average=244
Kmalloc N*(alloc free)(512): 0=247 1=245 2=241 3=255 4=245 5=256 6=242 7=253 8=296 9=244 10=240 11=255 12=245 13=256 14=242 15=250 Average=251
Kmalloc N*(alloc free)(1024): 0=259 1=255 2=247 3=254 4=245 5=244 6=248 7=248 8=256 9=254 10=247 11=254 12=245 13=245 14=249 15=249 Average=250
Kmalloc N*(alloc free)(2048): 0=248 1=248 2=243 3=243 4=251 5=259 6=251 7=248 8=248 9=249 10=244 11=244 12=250 13=246 14=250 15=247 Average=248
Kmalloc N*(alloc free)(4096): 0=243 1=243 2=259 3=244 4=243 5=244 6=244 7=244 8=242 9=243 10=246 11=245 12=243 13=245 14=244 15=244 Average=245
Remote free test
================
N*remote free(8): 0=5/3085 1=174/0 2=173/0 3=173/0 4=173/0 5=173/0 6=173/0 7=174/0 8=105/0 9=174/0 10=173/0 11=174/0 12=174/0 13=174/0 14=174/0 15=175/0 Average=159/192
N*remote free(16): 0=5/3341 1=185/0 2=184/0 3=185/0 4=185/0 5=186/0 6=183/0 7=185/0 8=114/0 9=185/0 10=184/0 11=185/0 12=186/0 13=188/0 14=185/0 15=187/0 Average=170/208
N*remote free(32): 0=4/2829 1=187/0 2=207/0 3=182/0 4=201/0 5=186/0 6=207/0 7=184/0 8=127/0 9=188/0 10=205/0 11=186/0 12=204/0 13=189/0 14=209/0 15=188/0 Average=178/176
N*remote free(64): 0=4/3535 1=233/0 2=238/0 3=226/0 4=239/0 5=230/0 6=233/0 7=232/0 8=174/0 9=228/0 10=237/0 11=223/0 12=239/0 13=228/0 14=233/0 15=230/0 Average=214/221
N*remote free(128): 0=3/4747 1=366/0 2=419/0 3=372/0 4=414/0 5=372/0 6=417/0 7=378/0 8=336/0 9=373/0 10=411/0 11=377/0 12=415/0 13=379/0 14=423/0 15=381/0 Average=365/296
N*remote free(256): 0=4/9083 1=456/0 2=443/0 3=461/0 4=441/0 5=460/0 6=446/0 7=456/0 8=392/0 9=453/0 10=446/0 11=458/0 12=441/0 13=460/0 14=446/0 15=455/0 Average=420/567
N*remote free(512): 0=4/9468 1=445/0 2=427/0 3=446/0 4=436/0 5=447/0 6=430/0 7=444/0 8=384/0 9=445/0 10=430/0 11=446/0 12=439/0 13=445/0 14=430/0 15=443/0 Average=409/591
N*remote free(1024): 0=3/10387 1=498/0 2=533/0 3=506/0 4=531/0 5=509/0 6=540/0 7=511/0 8=476/0 9=497/0 10=532/0 11=508/0 12=531/0 13=508/0 14=541/0 15=510/0 Average=483/649
N*remote free(2048): 0=4/10294 1=489/0 2=468/0 3=487/0 4=470/0 5=490/0 6=466/0 7=487/0 8=405/0 9=486/0 10=467/0 11=487/0 12=468/0 13=488/0 14=467/0 15=489/0 Average=445/643
N*remote free(4096): 0=4/12687 1=821/0 2=835/0 3=823/0 4=834/0 5=820/0 6=833/0 7=819/0 8=750/0 9=822/0 10=835/0 11=819/0 12=833/0 13=818/0 14=829/0 15=819/0 Average=770/793
1 alloc N free test
===================
1 alloc N free(8): 0=3949 1=1060 2=1046 3=1068 4=1049 5=1047 6=1049 7=1037 8=1070 9=1046 10=1044 11=1066 12=1048 13=1048 14=1051 15=1055 Average=1233
1 alloc N free(16): 0=3703 1=1153 2=1155 3=1154 4=1154 5=1150 6=1155 7=1150 8=1159 9=1154 10=1154 11=1154 12=1153 13=1149 14=1154 15=1150 Average=1313
1 alloc N free(32): 0=4098 1=997 2=999 3=1004 4=1001 5=996 6=993 7=1003 8=1003 9=1000 10=997 11=1003 12=1003 13=996 14=993 15=1001 Average=1193
1 alloc N free(64): 0=4567 1=1018 2=1020 3=1021 4=1020 5=1019 6=1016 7=1011 8=1022 9=1022 10=1019 11=1021 12=1019 13=1021 14=1020 15=1010 Average=1240
1 alloc N free(128): 0=6814 1=1345 2=1346 3=1343 4=1342 5=1345 6=1343 7=1345 8=1345 9=1344 10=1345 11=1343 12=1342 13=1344 14=1344 15=1344 Average=1686
1 alloc N free(256): 0=9469 1=946 2=945 3=945 4=944 5=944 6=945 7=941 8=943 9=943 10=942 11=945 12=943 13=945 14=941 15=944 Average=1477
1 alloc N free(512): 0=8600 1=1278 2=1280 3=1277 4=1278 5=1279 6=1277 7=1277 8=1279 9=1277 10=1279 11=1281 12=1280 13=1280 14=1279 15=1280 Average=1736
1 alloc N free(1024): 0=9485 1=844 2=844 3=842 4=841 5=841 6=841 7=842 8=841 9=842 10=843 11=843 12=842 13=842 14=842 15=843 Average=1382
1 alloc N free(2048): 0=10836 1=868 2=867 3=868 4=868 5=867 6=867 7=867 8=868 9=867 10=867 11=867 12=867 13=867 14=867 15=867 Average=1490
1 alloc N free(4096): 0=12653 1=930 2=929 3=929 4=928 5=927 6=928 7=927 8=928 9=929 10=928 11=930 12=928 13=930 14=928 15=929 Average=1661

Results for kernel C (Irqless fastpath):
---------------------------------------

Linux version 2.6.32-rc4-00027-gceb8d11-dirty (gcc version 4.3.4 (Debian 4.3.4-5) ) #8 SMP Tue Oct 13 14:14:05 CDT 2009
SLUB: Genslabs=14, HWalign=64, Order=0-3, MinObjects=0, CPUs=16, Nodes=2
Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 55 cycles kfree -> 251 cycles
10000 times kmalloc(16) -> 201 cycles kfree -> 261 cycles
10000 times kmalloc(32) -> 220 cycles kfree -> 261 cycles
10000 times kmalloc(64) -> 186 cycles kfree -> 224 cycles
10000 times kmalloc(128) -> 205 cycles kfree -> 125 cycles
10000 times kmalloc(256) -> 351 cycles kfree -> 267 cycles
10000 times kmalloc(512) -> 330 cycles kfree -> 310 cycles
10000 times kmalloc(1024) -> 416 cycles kfree -> 419 cycles
10000 times kmalloc(2048) -> 537 cycles kfree -> 439 cycles
10000 times kmalloc(4096) -> 458 cycles kfree -> 594 cycles
10000 times kmalloc(8192) -> 810 cycles kfree -> 678 cycles
10000 times kmalloc(16384) -> 879 cycles kfree -> 746 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 66 cycles
10000 times kmalloc(16)/kfree -> 187 cycles
10000 times kmalloc(32)/kfree -> 116 cycles
10000 times kmalloc(64)/kfree -> 107 cycles
10000 times kmalloc(128)/kfree -> 115 cycles
10000 times kmalloc(256)/kfree -> 65 cycles
10000 times kmalloc(512)/kfree -> 66 cycles
10000 times kmalloc(1024)/kfree -> 206 cycles
10000 times kmalloc(2048)/kfree -> 65 cycles
10000 times kmalloc(4096)/kfree -> 193 cycles
10000 times kmalloc(8192)/kfree -> 65 cycles
10000 times kmalloc(16384)/kfree -> 976 cycles
Concurrent allocs
=================
Kmalloc N*alloc N*free(8): 0=112/188 1=113/195 2=113/188 3=115/186 4=112/188 5=112/183 6=112/188 7=112/181 8=114/190 9=115/183 10=113/187 11=113/185 12=113/189 13=113/186 14=112/186 15=114/181 Average=113/187
Kmalloc N*alloc N*free(16): 0=124/196 1=125/205 2=123/196 3=127/199 4=124/195 5=124/198 6=123/196 7=125/207 8=124/194 9=124/208 10=123/198 11=126/199 12=125/196 13=125/199 14=125/198 15=126/202 Average=125/199
Kmalloc N*alloc N*free(32): 0=153/271 1=124/247 2=145/269 3=130/264 4=146/270 5=127/244 6=144/275 7=131/251 8=143/270 9=123/249 10=142/270 11=127/264 12=145/270 13=129/247 14=143/275 15=130/249 Average=136/262
Kmalloc N*alloc N*free(64): 0=172/615 1=169/370 2=181/493 3=170/388 4=179/494 5=169/417 6=177/495 7=169/391 8=176/504 9=167/369 10=178/494 11=168/381 12=178/493 13=168/431 14=178/494 15=170/394 Average=173/451
Kmalloc N*alloc N*free(128): 0=378/683 1=324/481 2=377/654 3=324/448 4=378/651 5=320/494 6=375/647 7=328/522 8=381/683 9=326/490 10=380/645 11=322/461 12=377/650 13=321/464 14=377/642 15=318/509 Average=350/570
Kmalloc N*alloc N*free(256): 0=441/906 1=424/670 2=436/837 3=428/658 4=435/839 5=425/669 6=439/839 7=427/671 8=435/893 9=425/669 10=434/832 11=425/663 12=434/835 13=422/661 14=437/824 15=424/652 Average=431/757
Kmalloc N*alloc N*free(512): 0=402/662 1=392/578 2=401/614 3=402/574 4=401/618 5=394/578 6=402/618 7=395/576 8=403/652 9=394/574 10=404/616 11=400/569 12=400/616 13=395/570 14=400/616 15=397/582 Average=399/601
Kmalloc N*alloc N*free(1024): 0=585/690 1=428/604 2=488/691 3=423/601 4=481/696 5=428/602 6=488/696 7=428/605 8=571/689 9=426/606 10=487/693 11=425/601 12=481/695 13=428/595 14=485/693 15=428/603 Average=467/647
Kmalloc N*alloc N*free(2048): 0=424/1273 1=437/834 2=422/1122 3=434/831 4=420/1122 5=439/837 6=421/1119 7=437/830 8=423/1259 9=436/822 10=424/1118 11=437/827 12=421/1120 13=436/841 14=423/1115 15=439/830 Average=430/994
Kmalloc N*alloc N*free(4096): 0=870/806 1=763/789 2=854/805 3=760/782 4=857/803 5=767/788 6=854/807 7=760/788 8=867/803 9=763/785 10=853/805 11=757/785 12=858/806 13=763/783 14=857/802 15=766/782 Average=811/795
Kmalloc N*(alloc free)(8): 0=139 1=138 2=138 3=140 4=139 5=139 6=138 7=140 8=139 9=138 10=137 11=140 12=140 13=140 14=138 15=141 Average=139
Kmalloc N*(alloc free)(16): 0=141 1=140 2=139 3=139 4=131 5=139 6=131 7=138 8=139 9=139 10=139 11=139 12=131 13=139 14=131 15=138 Average=137
Kmalloc N*(alloc free)(32): 0=132 1=140 2=131 3=139 4=139 5=138 6=138 7=140 8=132 9=140 10=132 11=140 12=139 13=139 14=139 15=140 Average=137
Kmalloc N*(alloc free)(64): 0=141 1=142 2=131 3=142 4=140 5=141 6=138 7=142 8=139 9=141 10=131 11=141 12=140 13=141 14=138 15=141 Average=139
Kmalloc N*(alloc free)(128): 0=140 1=139 2=132 3=138 4=139 5=139 6=138 7=139 8=140 9=139 10=132 11=139 12=139 13=140 14=138 15=140 Average=138
Kmalloc N*(alloc free)(256): 0=140 1=138 2=137 3=136 4=138 5=137 6=137 7=137 8=137 9=137 10=137 11=137 12=138 13=137 14=137 15=137 Average=137
Kmalloc N*(alloc free)(512): 0=137 1=136 2=138 3=138 4=137 5=135 6=136 7=136 8=137 9=135 10=137 11=137 12=137 13=146 14=137 15=137 Average=137
Kmalloc N*(alloc free)(1024): 0=138 1=138 2=139 3=138 4=135 5=137 6=137 7=137 8=137 9=137 10=138 11=137 12=146 13=137 14=137 15=137 Average=138
Kmalloc N*(alloc free)(2048): 0=136 1=136 2=135 3=137 4=136 5=137 6=136 7=137 8=137 9=136 10=144 11=138 12=145 13=138 14=136 15=138 Average=138
Kmalloc N*(alloc free)(4096): 0=136 1=136 2=137 3=137 4=137 5=137 6=138 7=136 8=147 9=135 10=137 11=137 12=137 13=137 14=138 15=137 Average=137
Remote free test
================
N*remote free(8): 0=5/3335 1=115/0 2=117/0 3=117/0 4=117/0 5=117/0 6=115/0 7=117/0 8=60/0 9=115/0 10=116/0 11=118/0 12=116/0 13=117/0 14=116/0 15=118/0 Average=106/208
N*remote free(16): 0=5/3944 1=126/0 2=123/0 3=127/0 4=125/0 5=127/0 6=126/0 7=127/0 8=68/0 9=125/0 10=124/0 11=126/0 12=126/0 13=128/0 14=127/0 15=127/0 Average=115/246
N*remote free(32): 0=4/3129 1=132/0 2=152/0 3=129/0 4=153/0 5=128/0 6=151/0 7=132/0 8=88/0 9=133/0 10=154/0 11=130/0 12=155/0 13=131/0 14=154/0 15=137/0 Average=129/195
N*remote free(64): 0=4/3313 1=197/0 2=204/0 3=196/0 4=194/0 5=200/0 6=196/0 7=189/0 8=143/0 9=194/0 10=201/0 11=186/0 12=198/0 13=190/0 14=192/0 15=189/0 Average=180/207
N*remote free(128): 0=3/4289 1=343/0 2=377/0 3=342/0 4=381/0 5=344/0 6=385/0 7=340/0 8=314/0 9=345/0 10=378/0 11=342/0 12=378/0 13=343/0 14=375/0 15=346/0 Average=334/268
N*remote free(256): 0=4/9425 1=423/0 2=408/0 3=419/0 4=407/0 5=419/0 6=405/0 7=420/0 8=352/0 9=423/0 10=409/0 11=422/0 12=409/0 13=418/0 14=405/0 15=419/0 Average=385/589
N*remote free(512): 0=4/9517 1=386/0 2=383/0 3=390/0 4=386/0 5=391/0 6=383/0 7=387/0 8=345/0 9=389/0 10=381/0 11=391/0 12=386/0 13=388/0 14=384/0 15=390/0 Average=360/594
N*remote free(1024): 0=3/10053 1=451/0 2=490/0 3=446/0 4=490/0 5=450/0 6=492/0 7=452/0 8=448/0 9=452/0 10=492/0 11=447/0 12=491/0 13=454/0 14=490/0 15=453/0 Average=438/628
N*remote free(2048): 0=4/11238 1=454/0 2=415/0 3=454/0 4=415/0 5=455/0 6=416/0 7=457/0 8=375/0 9=454/0 10=416/0 11=454/0 12=414/0 13=455/0 14=415/0 15=458/0 Average=407/702
N*remote free(4096): 0=3/10262 1=807/0 2=845/0 3=803/0 4=832/0 5=806/0 6=838/0 7=810/0 8=760/0 9=800/0 10=840/0 11=805/0 12=836/0 13=802/0 14=837/0 15=806/0 Average=764/641
1 alloc N free test
===================
1 alloc N free(8): 0=2119 1=606 2=611 3=593 4=603 5=580 6=592 7=587 8=617 9=607 10=607 11=588 12=608 13=578 14=570 15=603 Average=692
1 alloc N free(16): 0=3315 1=1177 2=1178 3=1175 4=1176 5=1177 6=1179 7=1177 8=1184 9=1178 10=1178 11=1175 12=1178 13=1177 14=1177 15=1175 Average=1311
1 alloc N free(32): 0=3005 1=952 2=946 3=954 4=948 5=952 6=954 7=944 8=956 9=955 10=945 11=955 12=947 13=946 14=954 15=947 Average=1079
1 alloc N free(64): 0=3534 1=1013 2=1013 3=1011 4=1013 5=1009 6=1009 7=1010 8=1014 9=1013 10=1012 11=1010 12=1012 13=1009 14=1008 15=1008 Average=1169
1 alloc N free(128): 0=6786 1=1406 2=1404 3=1408 4=1405 5=1404 6=1405 7=1404 8=1406 9=1404 10=1406 11=1407 12=1404 13=1407 14=1403 15=1405 Average=1742
1 alloc N free(256): 0=7496 1=1266 2=1269 3=1266 4=1269 5=1268 6=1266 7=1267 8=1266 9=1267 10=1268 11=1266 12=1269 13=1268 14=1267 15=1267 Average=1657
1 alloc N free(512): 0=6893 1=847 2=846 3=848 4=846 5=848 6=847 7=848 8=847 9=847 10=847 11=848 12=846 13=847 14=846 15=846 Average=1225
1 alloc N free(1024): 0=9241 1=839 2=841 3=839 4=838 5=838 6=838 7=835 8=837 9=837 10=838 11=839 12=837 13=839 14=837 15=838 Average=1363
1 alloc N free(2048): 0=8790 1=854 2=854 3=853 4=854 5=855 6=853 7=854 8=854 9=854 10=853 11=853 12=854 13=853 14=852 15=853 Average=1350
1 alloc N free(4096): 0=9548 1=922 2=924 3=924 4=924 5=924 6=923 7=921 8=923 9=923 10=925 11=922 12=924 13=922 14=923 15=924 Average=1462

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: this_cpu_xx's patchset effect on SLUB cycle counts
  2009-10-13 20:20 this_cpu_xx's patchset effect on SLUB cycle counts Christoph Lameter
@ 2009-10-14  3:12 ` David Rientjes
  2009-10-14 15:31   ` Christoph Lameter
  0 siblings, 1 reply; 3+ messages in thread
From: David Rientjes @ 2009-10-14  3:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, linux-kernel, Pekka Enberg, Tejun Heo, Mathieu Desnoyers

On Tue, 13 Oct 2009, Christoph Lameter wrote:

> 
> The recent this_cpu_xx patchsets have allowed an increase in the
> effectiveness of the allocation fastpath in SLUB by avoiding lookups and
> interrupt disable. The approaches likely can be also applied to other allocators.
> 
> Measurements were done using the in kernel page allocator benchmarks that
> were also posted today. I hope that these numbers can lead to an
> evaluation of how useful the this_cpu_xx operations are and how to most
> effectively apply them in the kernel.
> 
> The following kernels were run:
> 
> A. Upstream with Tejun's for-next tree (this include this_cpu_xx base
> functionality but not the enhancements to the page allocator and rework of
> slubs fastpath)
> 
> B. Kernel A with the page allocator and slub enhancements (including the
> one titled "aggressive use of this_cpu_xx").
> 
> C. Kernel B with the slub irqless patch on top.
> 
> Note that B and C are improving only the fastpath of the SLUB allocator.
> They do not affect slowpath nor page allocator fallback. Well not entirely
> true: C especially adds code to the slowpath. Question is if that offsets
> the gains in the fastpath
> 

I benchmarked this patchset both with and without the irqless patch from 
http://marc.info/?l=linux-kernel&m=125503037213262 on several of my 
machines.  The results were good for the most part, but I found a very 
reproducible regression on my 4-core 8G Xeon 5500 with HyperThreading for 
objects of smaller size (8, 16, and 64 bytes) without the irqless patch:

This is your baseline "Kernel A" (percpu#for-next at dec54bf "this_cpu: 
Use this_cpu_xx in trace_functions_graph.c"):

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 241 cycles kfree -> 283 cycles
10000 times kmalloc(16) -> 236 cycles kfree -> 291 cycles
10000 times kmalloc(32) -> 259 cycles kfree -> 300 cycles
10000 times kmalloc(64) -> 350 cycles kfree -> 331 cycles
10000 times kmalloc(128) -> 468 cycles kfree -> 373 cycles
10000 times kmalloc(256) -> 594 cycles kfree -> 525 cycles
10000 times kmalloc(512) -> 647 cycles kfree -> 560 cycles
10000 times kmalloc(1024) -> 707 cycles kfree -> 576 cycles
10000 times kmalloc(2048) -> 688 cycles kfree -> 594 cycles
10000 times kmalloc(4096) -> 860 cycles kfree -> 749 cycles
10000 times kmalloc(8192) -> 1219 cycles kfree -> 1054 cycles
10000 times kmalloc(16384) -> 1440 cycles kfree -> 1403 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 376 cycles
10000 times kmalloc(16)/kfree -> 370 cycles
10000 times kmalloc(32)/kfree -> 366 cycles
10000 times kmalloc(64)/kfree -> 368 cycles
10000 times kmalloc(128)/kfree -> 371 cycles
10000 times kmalloc(256)/kfree -> 380 cycles
10000 times kmalloc(512)/kfree -> 416 cycles
10000 times kmalloc(1024)/kfree -> 383 cycles
10000 times kmalloc(2048)/kfree -> 377 cycles
10000 times kmalloc(4096)/kfree -> 379 cycles
10000 times kmalloc(8192)/kfree -> 381 cycles
10000 times kmalloc(16384)/kfree -> 1697 cycles
Concurrent allocs
=================
Kmalloc N*alloc N*free(8): 0=459/460 1=458/465 2=3239/395 3=3237/431 Average=1848/438
Kmalloc N*alloc N*free(16): 0=820/506 1=821/516 2=1769/510 3=1769/512 Average=1295/511
Kmalloc N*alloc N*free(32): 0=545/520 1=548/534 2=850/527 3=848/535 Average=698/529
Kmalloc N*alloc N*free(64): 0=1105/753 1=1095/761 2=1020/773 3=1018/748 Average=1059/759
Kmalloc N*alloc N*free(128): 0=1217/1313 1=1224/1304 2=1171/1294 3=1171/1285 Average=1196/1299
Kmalloc N*alloc N*free(256): 0=1683/1976 1=1682/2008 2=1717/2001 3=1695/1974 Average=1694/1990
Kmalloc N*alloc N*free(512): 0=1813/2225 1=1826/2227 2=1822/2308 3=1823/2310 Average=1821/2267
Kmalloc N*alloc N*free(1024): 0=1767/2727 1=1767/2707 2=1871/2617 3=1871/2644 Average=1819/2674
Kmalloc N*alloc N*free(2048): 0=2416/2954 1=2416/2959 2=2376/2748 3=2382/2765 Average=2398/2857
Kmalloc N*alloc N*free(4096): 0=3263/3955 1=3274/3958 2=3280/3636 3=3269/3627 Average=3272/3794
Kmalloc N*(alloc free)(8): 0=576 1=574 2=582 3=582 Average=579
Kmalloc N*(alloc free)(16): 0=439 1=439 2=582 3=582 Average=511
Kmalloc N*(alloc free)(32): 0=597 1=596 2=593 3=593 Average=595
Kmalloc N*(alloc free)(64): 0=574 1=576 2=583 3=583 Average=579
Kmalloc N*(alloc free)(128): 0=595 1=595 2=597 3=597 Average=596
Kmalloc N*(alloc free)(256): 0=580 1=579 2=582 3=582 Average=581
Kmalloc N*(alloc free)(512): 0=586 1=588 2=576 3=576 Average=581
Kmalloc N*(alloc free)(1024): 0=625 1=625 2=614 3=614 Average=620
Kmalloc N*(alloc free)(2048): 0=570 1=571 2=584 3=582 Average=577
Kmalloc N*(alloc free)(4096): 0=585 1=584 2=577 3=577 Average=581

And this is your "Kernel B" (with v6 of your patchset plus the two fixes 
to dma_kmalloc_cache() and kmem_cache_open()):

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 323 cycles kfree -> 371 cycles
10000 times kmalloc(16) -> 289 cycles kfree -> 288 cycles
10000 times kmalloc(32) -> 254 cycles kfree -> 397 cycles
10000 times kmalloc(64) -> 398 cycles kfree -> 349 cycles
10000 times kmalloc(128) -> 420 cycles kfree -> 344 cycles
10000 times kmalloc(256) -> 588 cycles kfree -> 521 cycles
10000 times kmalloc(512) -> 646 cycles kfree -> 558 cycles
10000 times kmalloc(1024) -> 668 cycles kfree -> 581 cycles
10000 times kmalloc(2048) -> 690 cycles kfree -> 614 cycles
10000 times kmalloc(4096) -> 861 cycles kfree -> 738 cycles
10000 times kmalloc(8192) -> 1214 cycles kfree -> 1057 cycles
10000 times kmalloc(16384) -> 1503 cycles kfree -> 1380 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 371 cycles
10000 times kmalloc(16)/kfree -> 368 cycles
10000 times kmalloc(32)/kfree -> 366 cycles
10000 times kmalloc(64)/kfree -> 384 cycles
10000 times kmalloc(128)/kfree -> 364 cycles
10000 times kmalloc(256)/kfree -> 370 cycles
10000 times kmalloc(512)/kfree -> 372 cycles
10000 times kmalloc(1024)/kfree -> 377 cycles
10000 times kmalloc(2048)/kfree -> 375 cycles
10000 times kmalloc(4096)/kfree -> 377 cycles
10000 times kmalloc(8192)/kfree -> 379 cycles
10000 times kmalloc(16384)/kfree -> 1592 cycles
Concurrent allocs
=================
Kmalloc N*alloc N*free(8): 0=800/526 1=799/522 2=1252/511 3=1251/511 Average=1025/517
Kmalloc N*alloc N*free(16): 0=1055/498 1=1059/499 2=1822/517 3=1738/515 Average=1419/507
Kmalloc N*alloc N*free(32): 0=1207/563 1=1211/561 2=1402/543 3=1398/541 Average=1305/552
Kmalloc N*alloc N*free(64): 0=1275/757 1=1275/898 2=1515/904 3=1508/886 Average=1393/861
Kmalloc N*alloc N*free(128): 0=1295/1488 1=1294/1519 2=1272/1505 3=1244/1525 Average=1276/1509
Kmalloc N*alloc N*free(256): 0=1621/2524 1=1629/2547 2=1645/2540 3=1651/2518 Average=1637/2532
Kmalloc N*alloc N*free(512): 0=1883/2889 1=1892/2864 2=1898/2807 3=1898/2751 Average=1893/2828
Kmalloc N*alloc N*free(1024): 0=2393/3323 1=2400/3326 2=2402/3221 3=2311/3294 Average=2376/3291
Kmalloc N*alloc N*free(2048): 0=2642/3602 1=2653/3582 2=2635/3295 3=2629/3281 Average=2640/3440
Kmalloc N*alloc N*free(4096): 0=3383/4061 1=3383/4060 2=3312/3823 3=3306/3817 Average=3346/3940
Kmalloc N*(alloc free)(8): 0=674 1=666 2=648 3=647 Average=659
Kmalloc N*(alloc free)(16): 0=603 1=604 2=549 3=549 Average=576
Kmalloc N*(alloc free)(32): 0=565 1=566 2=550 3=550 Average=558
Kmalloc N*(alloc free)(64): 0=444 1=444 2=558 3=556 Average=501
Kmalloc N*(alloc free)(128): 0=448 1=449 2=556 3=556 Average=502
Kmalloc N*(alloc free)(256): 0=458 1=458 2=557 3=558 Average=508
Kmalloc N*(alloc free)(512): 0=460 1=461 2=591 3=591 Average=525
Kmalloc N*(alloc free)(1024): 0=461 1=460 2=590 3=636 Average=537
Kmalloc N*(alloc free)(2048): 0=629 1=629 2=671 3=671 Average=650
Kmalloc N*(alloc free)(4096): 0=574 1=574 2=642 3=642 Average=608

But "Kernel C" (with the irqless patch) shows a major improvement in the 
single threaded tests:

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 115 cycles kfree -> 285 cycles
10000 times kmalloc(16) -> 103 cycles kfree -> 283 cycles
10000 times kmalloc(32) -> 127 cycles kfree -> 301 cycles
10000 times kmalloc(64) -> 170 cycles kfree -> 330 cycles
10000 times kmalloc(128) -> 318 cycles kfree -> 389 cycles
10000 times kmalloc(256) -> 474 cycles kfree -> 532 cycles
10000 times kmalloc(512) -> 525 cycles kfree -> 627 cycles
10000 times kmalloc(1024) -> 568 cycles kfree -> 576 cycles
10000 times kmalloc(2048) -> 579 cycles kfree -> 593 cycles
10000 times kmalloc(4096) -> 772 cycles kfree -> 722 cycles
10000 times kmalloc(8192) -> 1148 cycles kfree -> 1019 cycles
10000 times kmalloc(16384) -> 1476 cycles kfree -> 1393 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 132 cycles
10000 times kmalloc(16)/kfree -> 125 cycles
10000 times kmalloc(32)/kfree -> 128 cycles
10000 times kmalloc(64)/kfree -> 125 cycles
10000 times kmalloc(128)/kfree -> 129 cycles
10000 times kmalloc(256)/kfree -> 136 cycles
10000 times kmalloc(512)/kfree -> 140 cycles
10000 times kmalloc(1024)/kfree -> 136 cycles
10000 times kmalloc(2048)/kfree -> 151 cycles
10000 times kmalloc(4096)/kfree -> 136 cycles
10000 times kmalloc(8192)/kfree -> 151 cycles
10000 times kmalloc(16384)/kfree -> 1584 cycles
Concurrent allocs
=================
Kmalloc N*alloc N*free(8): 0=3083/439 1=3081/444 2=3619/455 3=3614/459 Average=3349/449
Kmalloc N*alloc N*free(16): 0=382/504 1=382/501 2=1899/446 3=1896/445 Average=1140/474
Kmalloc N*alloc N*free(32): 0=270/479 1=204/477 2=1012/483 3=1007/488 Average=623/482
Kmalloc N*alloc N*free(64): 0=1287/922 1=1286/913 2=547/911 3=505/903 Average=906/912
Kmalloc N*alloc N*free(128): 0=1221/1496 1=1211/1485 2=1240/1509 3=1232/1508 Average=1226/1499
Kmalloc N*alloc N*free(256): 0=1644/2561 1=1645/2536 2=1802/2560 3=1808/2550 Average=1725/2552
Kmalloc N*alloc N*free(512): 0=1993/2796 1=1999/2806 2=1971/2768 3=1964/2753 Average=1982/2781
Kmalloc N*alloc N*free(1024): 0=1931/3203 1=1929/3216 2=1935/3124 3=1933/3095 Average=1932/3159
Kmalloc N*alloc N*free(2048): 0=2465/3497 1=2461/3485 2=2459/3268 3=2464/3512 Average=2462/3441
Kmalloc N*alloc N*free(4096): 0=3340/3775 1=3336/3771 2=3318/3819 3=3312/4003 Average=3326/3842
Kmalloc N*(alloc free)(8): 0=4199 1=4198 2=4271 3=4270 Average=4234
Kmalloc N*(alloc free)(16): 0=224 1=225 2=565 3=566 Average=395
Kmalloc N*(alloc free)(32): 0=605 1=604 2=584 3=584 Average=594
Kmalloc N*(alloc free)(64): 0=566 1=564 2=573 3=574 Average=569
Kmalloc N*(alloc free)(128): 0=571 1=571 2=571 3=570 Average=571
Kmalloc N*(alloc free)(256): 0=687 1=686 2=747 3=746 Average=716
Kmalloc N*(alloc free)(512): 0=5089 1=5086 2=3832 3=3836 Average=4461
Kmalloc N*(alloc free)(1024): 0=4970 1=4971 2=5082 3=5086 Average=5027
Kmalloc N*(alloc free)(2048): 0=4932 1=4930 2=4237 3=4257 Average=4589
Kmalloc N*(alloc free)(4096): 0=698 1=697 2=811 3=812 Average=755

While most of my machines showed an improvement from Kernel A -> Kernel B 
-> Kernel C, this one had a significant regression from A to B.

"Kernel C" hangs on my netserver machine during netperf -t TCP_RR -l 60, 
though, so hopefully I'll be able to obtain results for that benchmark 
with the irqless patch and see if there's any noticable improvement once 
it's debugged.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: this_cpu_xx's patchset effect on SLUB cycle counts
  2009-10-14  3:12 ` David Rientjes
@ 2009-10-14 15:31   ` Christoph Lameter
  0 siblings, 0 replies; 3+ messages in thread
From: Christoph Lameter @ 2009-10-14 15:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, linux-kernel, Pekka Enberg, Tejun Heo, Mathieu Desnoyers

On Tue, 13 Oct 2009, David Rientjes wrote:

> I benchmarked this patchset both with and without the irqless patch from
> http://marc.info/?l=linux-kernel&m=125503037213262 on several of my
> machines.  The results were good for the most part, but I found a very
> reproducible regression on my 4-core 8G Xeon 5500 with HyperThreading for
> objects of smaller size (8, 16, and 64 bytes) without the irqless patch:

Hmmm... Strange. Maybe different icache cacheline code placement? There is
no change in data structures without the irqless patch.

Can you change some kernel config options that impact memory and code
layout and rerun? Just to make sure that this is not a freak thing due to
code placement. Are sure sure that the kernel tested had the patches
applied?

> But "Kernel C" (with the irqless patch) shows a major improvement in the
> single threaded tests:

C changes per cpu layout a bit as well as does code changes.

> 2. Kmalloc: alloc/free test
> 10000 times kmalloc(8)/kfree -> 132 cycles

Was the kernel compiled with preemption on? I get cycle numbers with two
digits on these tests using quad nehalems.

> "Kernel C" hangs on my netserver machine during netperf -t TCP_RR -l 60,
> though, so hopefully I'll be able to obtain results for that benchmark
> with the irqless patch and see if there's any noticable improvement once
> it's debugged.

irqless is a risky patch. There may still be issues there. Thanks for
testing it.



^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2009-10-14 15:43 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-13 20:20 this_cpu_xx's patchset effect on SLUB cycle counts Christoph Lameter
2009-10-14  3:12 ` David Rientjes
2009-10-14 15:31   ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.