* RE: The cause of the "VM" performance problem with 2.4.X
@ 2001-08-28 17:35 Van Maren, Kevin
2001-08-28 18:52 ` Linus Torvalds
0 siblings, 1 reply; 14+ messages in thread
From: Van Maren, Kevin @ 2001-08-28 17:35 UTC (permalink / raw)
To: 'Andrew Morton'; +Cc: 'linux-kernel@vger.kernel.org'
Okay, here are some more results with Morton's "2nd" patch,
running 2.4.8,IA64-010815 on 4X IA64 (Lion) with 8GB RAM,
733MHz B3 (4MB) CPUs. 3 Adaptec 39160 channels, each with
8 (18GB) Cheetah drives, create 4 md's using 2 disks from
each channel. Also using Gibbs' 6.2.0 adaptec driver.
I tried to extract the "important" info to keep this message
relatively small. Meant to get this out earlier...
It looks like "getblk" could use some additional optimization.
More info to follow, but even doing one "mkfs" at a time,
the lru_list_lock is held over 82% of the time, mostly by
getblk().
SMP test case: 4 parallel mkfs (each on 6-disk md):
===================================================
"start"
mkfs /dev/md0 &
mkfs /dev/md1 &
mkfs /dev/md2 &
mkfs /dev/md3
"stop"
Abbreiated/stripped kernprof/gprof output:
------------------------------------------
Each sample counts as 0.000976562 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
39.46 224.65 224.65 cg_record_arc
16.40 318.00 93.34 6722992 0.01 0.02 getblk
9.02 369.33 51.33 50673121 0.00 0.00 spin_lock_
6.67 407.27 37.95 6722669 0.01 0.01 _make_request
4.51 432.97 25.70 13445261 0.00 0.00 blk_get_queue
2.61 447.83 14.86 long_copy_user
2.59 462.56 14.72 mcount
2.06 474.27 11.71 cpu_idle
1.35 481.97 7.70 26887936 0.00 0.00 _insert_into_lru_list
1.23 489.00 7.02 12463048 0.00 0.00 write_lock_
1.08 495.15 6.15 12463047 0.00 0.00 write_unlock_
0.94 500.52 5.37 7 767.72 2342.72 _invalidate_buffers
0.89 505.56 5.04 72518 0.07 0.08 scsi_dispatch_cmd
0.83 510.26 4.70 216716 0.02 0.65 block_write
0.77 514.63 4.37 20169809 0.00 0.00 _refile_buffer
0.74 518.82 4.19 68098 0.06 0.08 scsi_init_io_vc
0.73 522.96 4.15 58177 0.07 0.34 _scsi_end_request
0.64 526.60 3.64 25304057 0.00 0.00 _remove_from_lru_list
0.59 529.97 3.36 718 4.68 4.68 si_swapinfo
0.56 533.17 3.21 5644224 0.00 0.00 end_buffer_io_sync
0.40 535.46 2.29 43581 0.05 0.06 ahc_linux_isr
0.36 537.50 2.04 210123 0.01 0.45 write_some_buffers
0.35 539.52 2.02 6729267 0.00 0.00 _remove_from_free_list
0.34 541.46 1.94 13445412 0.00 0.00 balance_dirty_state
[misc "interesting" others]
0.09 557.35 0.49 116 4.22 41.74 wait_for_buffers
0.06 560.37 0.33 58177 0.01 0.35 scsi_finish_command
0.02 565.67 0.14 54 2.57 1655.62 sync_old_buffers
0.01 567.60 0.04 42 1.00 18.82 page_launder
0.00 569.24 0.00 43 0.05 146.48 write_unlocked_buffers
0.00 569.29 0.00 42 0.00 19.44 do_try_to_free_pages
0.00 569.29 0.00 29 0.00 306.44 sync_buffers
0.00 569.29 0.00 28 0.00 92.44
wait_for_locked_buffers
0.00 569.29 0.00 15 0.00 306.48 fsync_dev
0.00 569.29 0.00 8 0.00 574.65 sys_fsync
0.00 569.29 0.00 7 0.00 2955.64 blkdev_put
index % time self children called name
[1] 40.5 224.65 0.00 cg_record_arc [1]
[2] 32.3 0.02 179.06 ia64_ret_from_syscall [2]
0.10 150.24 232733/232733 sys_write [3]
0.10 150.24 232733/232733 ia64_ret_from_syscall [2]
[3] 27.1 0.10 150.24 232733 sys_write [3]
4.70 136.37 216716/216716 block_write [4]
4.70 136.37 216716/216716 sys_write [3]
[4] 25.4 4.70 136.37 216716 block_write [4]
93.34 26.66 6722748/6722992 getblk [5]
1.16 11.57 6722748/6723007 mark_buffer_dirty [30]
0.32 2.90 13445336/13445412 balance_dirty [54]
0.40 0.00 6722748/6736794 _brelse [112]
0.00 0.00 32/6722992 block_read [391]
0.00 0.00 212/6722992 bread [298]
93.34 26.66 6722748/6722992 block_write [4]
[5] 21.6 93.34 26.66 6722992 getblk [5]
13.67 0.00 13492508/50673121 spin_lock_ [12]
3.80 0.00 6747334/12463048 write_lock_ [40]
3.33 0.00 6747334/12463047 write_unlock_ [42]
2.02 0.00 6720832/6729267 _remove_from_free_list [69]
1.93 0.00 6720832/26887936 _insert_into_lru_list [37]
0.00 0.76 24342/24342 refill_freelist [90]
0.74 0.00 6720832/6720832 _insert_into_queues [92]
0.26 0.00 13492508/51086468 spin_unlock_ [78]
0.16 0.00 6720832/6720832 init_buffer [144]
0.13 6.15 13845/210123 write_unlocked_buffers [41]
1.91 87.15 196278/210123 sync_old_buffers [8]
[6] 17.2 2.04 93.30 210123 write_some_buffers [6]
0.13 88.77 210087/210087 write_locked_buffers [9]
1.46 0.97 6722667/20169809 _refile_buffer [39]
1.93 0.00 6722665/26887936 _insert_into_lru_list [37]
[7] 16.5 0.00 91.66 kupdate [7]
0.14 89.26 54/54 sync_old_buffers [8]
0.23 2.03 54/116 wait_for_buffers [46]
0.14 89.26 54/54 kupdate [7]
[8] 16.1 0.14 89.26 54 sync_old_buffers [8]
1.91 87.15 196278/210123 write_some_buffers [6]
0.20 0.00 196385/50673121 spin_lock_ [12]
0.13 88.77 210087/210087 write_some_buffers [6]
[9] 16.0 0.13 88.77 210087 write_locked_buffers [9]
1.10 87.67 6722665/6722669 submit_bh [10]
0.00 0.00 4/6722669 ll_rw_block [478]
1.10 87.67 6722665/6722669 write_locked_buffers [9]
[10] 16.0 1.10 87.67 6722669 submit_bh [10]
0.97 86.70 6722669/6722669 generic_make_request [11]
0.97 86.70 6722669/6722669 submit_bh [10]
[11] 15.8 0.97 86.70 6722669 generic_make_request [11]
37.95 6.98 6722669/6722669 _make_request [13]
25.70 13.88 13445261/13445261 blk_get_queue [14]
0.62 1.58 6722592/6722592 md_make_request [67]
0.11 0.00 108665/50673121 _wake_up [113]
0.16 0.00 156642/50673121 real_lookup [71]
0.16 0.00 156651/50673121 d_rehash [143]
0.16 0.00 156653/50673121 d_alloc [138]
0.16 0.00 156656/50673121 d_instantiate [142]
0.16 0.00 156661/50673121 get_empty_inode [136]
0.20 0.00 196385/50673121 sync_old_buffers [8]
0.22 0.00 218577/50673121 sys_lseek [38]
0.37 0.00 360440/50673121 timer_bh [87]
0.39 0.00 389480/50673121 get_unused_buffer_head
[103]
0.41 0.00 405522/50673121 d_lookup [105]
0.81 0.00 800334/50673121 atomic_dec_and_lock [83]
4.81 0.00 4749700/50673121 put_last_free [45]
6.81 0.00 6722665/50673121 refile_buffer [34]
6.81 0.00 6722816/50673121 _make_request [13]
13.62 0.00 13445261/50673121 blk_get_queue [14]
13.67 0.00 13492508/50673121 getblk [5]
[12] 9.3 51.33 0.00 50673121 spin_lock_ [12]
37.95 6.98 6722669/6722669 generic_make_request [11]
[13] 8.1 37.95 6.98 6722669 _make_request [13]
6.81 0.00 6722816/50673121 spin_lock_ [12]
0.13 0.00 6722816/51086468 spin_unlock_ [78]
0.00 0.03 147/147 _get_request_wait [187]
25.70 13.88 13445261/13445261 generic_make_request [11]
[14] 7.1 25.70 13.88 13445261 blk_get_queue [14]
13.62 0.00 13445261/50673121 spin_lock_ [12]
0.26 0.00 13445261/51086468 spin_unlock_ [78]
[15] 4.5 0.05 24.67 ia64_leave_kernel [15]
0.00 24.28 1479060/1479060 ia64_handle_irq [17]
0.31 0.05 26208/30536 schedule [111]
0.00 0.03 3629/3629 ia64_do_page_fault [186]
0.00 24.28 1492630/1492630 ia64_handle_irq [17]
[16] 4.4 0.00 24.28 1492630 do_IRQ [16]
0.13 20.78 403703/444995 do_softirq [18]
0.39 2.89 1492630/1492630 handle_IRQ_event [53]
Lockmeter output (stripped -- holds lock for 1ms+, or 3%+):
-----------------------------------------------------------
System: Linux vanmaren.slc.unisys.com 2.4.8 #23 SMP Fri Aug 24 22:10:04 UTC
2001 ia64
Total counts
All (4) CPUs
Selecting locks that meet ALL of the following:
utilization : > 1.00%
Start time: Fri Aug 24 22:23:05 2001
End time: Fri Aug 24 22:28:43 2001
Delta Time: 337.74 sec.
Hash table slots in use: 388.
Global read lock slots in use: 245.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - -
SPINLOCKS HOLD WAIT
UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN
RJECT NAME
25.1% 6.1us(5450ms) 59us(5817ms)(65.4%) 60013220 74.9% 25.1%
0.00% *TOTAL*
3.1% 0.06% 25us( 14ms) 0us 414766 100% 0%
0.06% global_bh_lock
3.1% 0.06% 25us( 14ms) 0us 414766 100% 0%
0.06% bh_action+0x60
6.6% 3.9% 1.1us( 201us) 11us( 223us)(0.63%) 20495246 96.1% 3.9%
0% io_request_lock
3.1% 0.44% 28us(5450ms) 16ms(5817ms)( 2.0%) 381041 99.6% 0.44%
0% kernel_flag
2.7% 57.1% 1280ms(5450ms) 96us( 183us)(0.00%) 7 42.9% 57.1%
0% blkdev_put+0xf0
0.26% 0.19% 6.5us(2393us) 33ms(5450ms)(0.65%) 136598 99.8% 0.19%
0% real_lookup+0x220
0.12% 0.56% 27us(1026us) 28us( 709us)(0.00%) 15118 99.4% 0.56%
0% tty_write+0x4c0
0.03% 0.38% 29us(5251us) 485ms(5817ms)(0.43%) 3153 99.6% 0.38%
0% vfs_readdir+0x1e0
87.8% 95.5% 21us(5450ms) 62us(5450ms)(62.1%) 14241646 4.5% 95.5%
0% lru_list_lock
2.7% 14.3% 1280ms(5450ms) 2.6us( 2.6us)(0.00%) 7 85.7% 14.3%
0% __invalidate_buffers+0x60
0.03% 100% 4320us( 43ms) 45us( 102us)(0.00%) 20 0% 100%
0% bdflush+0x1d0
78.1% 96.9% 37us( 12ms) 62us( 256ms)(31.6%) 7143220 3.1% 96.9%
0% getblk+0x60
1.3% 63.0% 80ms( 261ms) 65us( 436us)(0.00%) 54 37.0% 63.0%
0% kupdate+0x1b0
4.1% 94.5% 2.0us(2945us) 61us(5450ms)(28.6%) 6725777 5.5% 94.5%
0% refile_buffer+0x50
1.0% 95.9% 18us(6871us) 66us( 12ms)(0.92%) 195148 4.1% 95.9%
0% sync_old_buffers+0x120
0.47% 85.3% 9.8us(1407us) 81us( 26ms)(0.83%) 162013 14.7% 85.3%
0% try_to_free_buffers+0xb0
0.03% 8.7% 1102us( 55ms) 52ms( 199ms)(0.03%) 92 91.3% 8.7%
0% wait_for_locked_buffers+0x60
0.19% 27.6% 42us( 20ms) 324us( 253ms)(0.10%) 14915 72.4% 27.6%
0% write_unlocked_buffers+0x60
2.1% 9.1% 1.0us(3507us) 11us( 10ms)(0.51%) 6745773 90.9% 9.1%
0% unused_list_lock
2.0% 9.1% 1.0us(3507us) 11us( 10ms)(0.50%) 6720952 90.9% 9.1%
0% get_unused_buffer_head+0x50
1.0% 0.29% 0.5us(1386us) 2.0us( 712us)(0.00%) 7140899 99.7% 0.29%
0% getblk+0x2d0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - -
RWLOCK READS HOLD MAX RDR BUSY PERIOD WAIT
UTIL CON MEAN RDRS MEAN( MAX ) MEAN( MAX )( %CPU) TOTAL
NOWAIT SPIN NAME
0.00% 20us( 45us)(0.00%) 847741
100% 0.00% *TOTAL*
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - -
RWLOCK WRITES HOLD WAIT (ALL) WAIT (WW)
UTIL CON MEAN( MAX ) MEAN( MAX )( %CPU) MEAN( MAX ) TOTAL
NOWAIT SPIN( WW ) NAME
0.00% 18us( 12ms) 6.1us( 52us)(0.00%) 1.9us( 2.3us) 12941611
100% 0.00%(0.00%) *TOTAL*
69.2% 0.00% 19us( 12ms) 11us( 52us)(0.00%) 1.9us( 2.3us) 12058253
100% 0.00%(0.00%) hash_table_lock
1.8% 0% 1.3us( 49us) 0us 0us 4753020
100% 0%( 0%) __invalidate_buffers+0x270
67.2% 0.00% 32us( 12ms) 11us( 52us)(0.00%) 1.9us( 2.3us) 7143220
100% 0.00%(0.00%) getblk+0x80
0.24% 0% 4.9us(1402us) 0us 0us 162013
100% 0%( 0%) try_to_free_buffers+0xd0
=========================================================
Single CPU/FS test case: 1 mkfs (md of the 4 md's above):
Note that only 16% is "idle" on a 4X.
=========================================================
Kgprof:
-------
Each sample counts as 0.000976562 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
28.88 158.93 158.93 cg_record_arc
15.89 246.38 87.44 cpu_idle
14.82 327.95 81.58 6723015 0.01 0.02 getblk
14.42 407.30 79.35 59278971 0.00 0.00 spin_lock_
5.23 436.10 28.80 20168512 0.00 0.00 blk_get_queue
4.60 461.44 25.34 6722884 0.00 0.01 _make_request
2.56 475.53 14.08 mcount
2.21 487.71 12.18 long_copy_user
0.91 492.72 5.01 87690 0.06 0.06 scsi_dispatch_cmd
0.78 496.99 4.27 83855 0.05 0.27 _scsi_end_request
0.75 501.11 4.12 87690 0.05 0.08 scsi_init_io_vc
0.68 504.85 3.74 2 1870.61 7179.64 _invalidate_buffers
0.65 508.43 3.57 6398252 0.00 0.00 end_buffer_io_sync
0.51 511.25 2.82 56707 0.05 0.06 ahc_linux_isr
0.49 513.97 2.72 666 4.09 4.09 si_swapinfo
0.44 516.40 2.43 82692 0.03 0.03 scsi_malloc
0.32 518.17 1.78 26887713 0.00 0.00 _insert_into_lru_list
[other interesting points]
0.24 522.20 1.33 216677 0.01 0.55 block_write
0.21 527.01 1.16 87102 0.01 0.16 scsi_request_fn
0.10 537.63 0.54 47 11.55 31.51 page_launder
0.04 544.66 0.24 98 2.43 15.73 wait_for_buffers
0.01 548.49 0.05 47 0.98 1887.26 sync_old_buffers
0.00 550.42 0.00 47 0.00 32.21 do_try_to_free_pages
0.00 550.42 0.00 12 0.00 462.33 write_unlocked_buffers
0.00 550.42 0.00 8 0.00 793.75 sync_buffers
0.00 550.42 0.00 8 0.00 100.26
wait_for_locked_buffers
0.00 550.42 0.00 4 0.00 793.75 fsync_dev
0.00 550.42 0.00 2 0.00 8767.16 blkdev_put
0.00 550.42 0.00 2 0.00 1587.54 sys_fsync
[1] 29.6 158.93 0.00 cg_record_arc [1]
[2] 28.3 0.03 152.01 ia64_ret_from_syscall [2]
0.09 127.90 231867/231867 sys_write [3]
0.03 7.21 66165/66165 sys_read [37]
0.07 6.84 218393/218393 sys_lseek [38]
0.09 127.90 231867/231867 ia64_ret_from_syscall [2]
[3] 23.9 0.09 127.90 231867 sys_write [3]
1.33 118.33 216677/216677 block_write [4]
1.33 118.33 216677/216677 sys_write [3]
[4] 22.3 1.33 118.33 216677 block_write [4]
81.58 22.81 6722853/6723015 getblk [5]
0.64 10.42 6722853/6723043 mark_buffer_dirty [31]
0.00 0.00 8/6723015 block_read [414]
0.00 0.00 154/6723015 bread [320]
81.58 22.81 6722853/6723015 block_write [4]
[5] 19.5 81.58 22.81 6723015 getblk [5]
18.08 0.00 13509844/59278971 spin_lock_ [13]
0.00 1.17 32955/32955 refill_freelist [76]
0.87 0.00 6720919/6720919 _insert_into_queues [84]
0.76 0.00 6720919/6720919 _remove_from_free_list [89]
0.58 0.00 6755970/14043341 write_unlock_ [75]
0.47 0.00 6755970/14043341 write_lock_ [83]
0.44 0.00 6720919/26887713 _insert_into_lru_list [64]
0.25 0.00 13509844/59693888 spin_unlock_ [78]
0.18 0.00 6720919/6720919 init_buffer [142]
0.01 5.52 12374/210102 write_unlocked_buffers [42]
0.20 88.19 197728/210102 sync_old_buffers [11]
[6] 17.5 0.21 93.71 210102 write_some_buffers [6]
0.04 92.58 210092/210092 write_locked_buffers [7]
0.36 0.28 6722899/22499809 _refile_buffer [59]
0.44 0.00 6722883/26887713 _insert_into_lru_list [64]
0.04 92.58 210092/210092 write_some_buffers [6]
[7] 17.3 0.04 92.58 210092 write_locked_buffers [7]
0.30 92.28 6722883/6722884 submit_bh [8]
0.00 0.00 1/6722884 ll_rw_block [507]
0.30 92.28 6722883/6722884 write_locked_buffers [7]
[8] 17.3 0.30 92.28 6722884 submit_bh [8]
0.25 92.02 6722884/6722884 generic_make_request [9]
0.25 92.02 6722884/6722884 submit_bh [8]
[9] 17.2 0.25 92.02 6722884 generic_make_request [9]
28.80 27.38 20168512/20168512 blk_get_queue [14]
25.34 9.16 6722884/6722884 _make_request [15]
0.33 1.01 13445628/13445628 md_make_request [70]
[10] 16.7 0.00 89.44 kupdate [10]
0.05 88.66 47/47 sync_old_buffers [11]
0.11 0.62 47/98 wait_for_buffers [65]
0.05 88.66 47/47 kupdate [10]
[11] 16.5 0.05 88.66 47 sync_old_buffers [11]
0.20 88.19 197728/210102 write_some_buffers [6]
0.26 0.00 197821/59278971 spin_lock_ [13]
0.00 0.00 47/47 sync_unlocked_inodes [295]
0.00 0.00 47/51 sync_supers [424]
0.00 0.00 187/59693888 spin_unlock_ [78]
[12] 16.3 87.44 0.09 cpu_idle [12]
0.06 0.03 10282/31233 schedule [129]
0.15 0.00 109352/59278971 ahc_linux_isr [50]
0.15 0.00 113900/59278971 do_IRQ [33]
0.18 0.00 136509/59278971 real_lookup [63]
0.18 0.00 136518/59278971 d_alloc [132]
0.18 0.00 136518/59278971 d_rehash [141]
0.18 0.00 136521/59278971 d_instantiate [140]
0.18 0.00 136521/59278971 get_empty_inode [133]
0.19 0.00 139538/59278971 _wake_up [107]
0.23 0.00 173973/59278971 try_to_free_buffers [91]
0.26 0.00 197821/59278971 sync_old_buffers [11]
0.29 0.00 218393/59278971 sys_lseek [38]
0.47 0.00 353459/59278971 d_lookup [101]
0.47 0.00 354202/59278971 timer_bh [99]
0.71 0.00 527288/59278971 get_unused_buffer_head
[87]
0.93 0.00 696032/59278971 atomic_dec_and_lock [82]
8.45 0.00 6310839/59278971 put_last_free [36]
9.00 0.00 6722880/59278971 refile_buffer [35]
9.00 0.00 6723070/59278971 _make_request [15]
18.08 0.00 13509844/59278971 getblk [5]
27.00 0.00 20168512/59278971 blk_get_queue [14]
[13] 14.8 79.35 0.00 59278971 spin_lock_ [13]
28.80 27.38 20168512/20168512 generic_make_request [9]
[14] 10.5 28.80 27.38 20168512 blk_get_queue [14]
27.00 0.00 20168512/59278971 spin_lock_ [13]
0.38 0.00 20168512/59693888 spin_unlock_ [78]
Lockstat:
---------
System: Linux vanmaren.slc.unisys.com 2.4.8 #23 SMP Fri Aug 24 22:10:04 UTC
2001 ia64
Total counts
All (4) CPUs
Selecting locks that meet ALL of the following:
utilization : > 1.00%
Start time: Fri Aug 24 22:43:51 2001
End time: Fri Aug 24 22:48:34 2001
Delta Time: 283.01 sec.
Hash table slots in use: 297.
Global read lock slots in use: 214.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - -
SPINLOCKS HOLD WAIT
UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN
RJECT NAME
1.1% 4.6us( 582ms) 47us(5781ms)( 2.7%) 59050695 98.9% 1.1%
0.02% *TOTAL*
1.0% 5.1% 18us( 111us) 35us( 576us)(0.03%) 163560 94.9% 5.1%
0% allocator_request_lock
1.4% 0.01% 0.3us( 7.2us) 15us( 41us)(0.00%) 13108126 100% 0.01%
0% free_list+0x18
3.8% 2.3% 27us(7193us) 0us 400694 97.7% 0%
2.3% global_bh_lock
3.8% 2.3% 27us(7193us) 0us 400694 97.7% 0%
2.3% bh_action+0x60
5.5% 0.78% 0.6us( 96us) 3.4us( 95us)(0.06%) 27316523 99.2% 0.78%
0% io_request_lock
3.6% 0.27% 1.5us( 42us) 3.9us( 45us)(0.01%) 6723632 99.7% 0.27%
0% __make_request+0x290
82.4% 3.1% 17us( 582ms) 32us( 582ms)( 1.2%) 13719319 96.9% 3.1%
0% lru_list_lock
0.04% 0% 59ms( 117ms) 0us 2 100% 0%
0% __invalidate_buffers+0x60
78.3% 0.26% 33us( 97us) 89us( 296ms)(0.14%) 6746687 99.7% 0.26%
0% getblk+0x60
1.5% 18.2% 99ms( 582ms) 5.4us( 29us)(0.00%) 44 81.8% 18.2%
0% kupdate+0x1b0
0.01% 0% 603us( 15ms) 0us 48 100% 0%
0% wait_for_locked_buffers+0x60
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - -
RWLOCK READS HOLD MAX RDR BUSY PERIOD WAIT
UTIL CON MEAN RDRS MEAN( MAX ) MEAN( MAX )( %CPU) TOTAL
NOWAIT SPIN NAME
0.00% 1.2us( 1.2us)(0.00%) 766789
100% 0.00% *TOTAL*
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - -
RWLOCK WRITES HOLD WAIT (ALL) WAIT (WW)
UTIL CON MEAN( MAX ) MEAN( MAX )( %CPU) MEAN( MAX ) TOTAL
NOWAIT SPIN( WW ) NAME
0% 16us( 97us) 0us 0us 13859543
100% 0%( 0%) *TOTAL*
79.8% 0% 17us( 97us) 0us 0us 13125889
100% 0%( 0%) hash_table_lock
2.6% 0% 1.2us( 5.4us) 0us 0us 6339799
100% 0%( 0%) __invalidate_buffers+0x270
76.9% 0% 32us( 97us) 0us 0us 6746687
100% 0%( 0%) getblk+0x80
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: The cause of the "VM" performance problem with 2.4.X
2001-08-28 17:35 The cause of the "VM" performance problem with 2.4.X Van Maren, Kevin
@ 2001-08-28 18:52 ` Linus Torvalds
2001-08-28 19:29 ` André Dahlqvist
2001-08-29 8:22 ` Jens Axboe
0 siblings, 2 replies; 14+ messages in thread
From: Linus Torvalds @ 2001-08-28 18:52 UTC (permalink / raw)
To: kevin.vanmaren, linux-kernel
In article <245F259ABD41D511A07000D0B71C4CBA289F5F@us-slc-exch-3.slc.unisys.com> you
write:
>
>Okay, here are some more results with Morton's "2nd" patch,
>running 2.4.8,IA64-010815 on 4X IA64 (Lion) with 8GB RAM,
>733MHz B3 (4MB) CPUs. 3 Adaptec 39160 channels, each with
>8 (18GB) Cheetah drives, create 4 md's using 2 disks from
>each channel. Also using Gibbs' 6.2.0 adaptec driver.
>I tried to extract the "important" info to keep this message
>relatively small. Meant to get this out earlier...
Interesting.
>It looks like "getblk" could use some additional optimization.
>More info to follow, but even doing one "mkfs" at a time,
>the lru_list_lock is held over 82% of the time, mostly by
>getblk().
Note that the real fix is to make the block devices use the page cache.
Andrea has patches, but it's a 2.5.x thing.
The buffer cache is really meant to be used only for meta-data, and
"getblk()" should not show up on profiles under any reasonable load. I
seriously doubt that running parallel (or even a single) "mkfs" is a
reasonable load under normal use.
So I don't consider the buffer cache locking to be a critical
performance issue - but I _do_ want to make sure that we don't have any
_really_ horrid special cases where we just perform so abysmally that
it's not worth using Linux at all.
So small inefficiencies and lack of parallelism is not a huge issue for
the buffer cache - at least not until somebody finds a more
"interesting" load per se.
As to the really bad behaviour with the CPU spending all the time just
traversing the dirty buffer lists, I'd love to hear what 2.4.10-pre2
does. It doesn't use Andrews advanced patch, but instead just makes
dirty buffer balancing not care about the device - which should allow
much better IO behaviour and not have any silly bad cases from a VM
standpoint.
>Abbreiated/stripped kernprof/gprof output:
>------------------------------------------
>
>Each sample counts as 0.000976562 seconds.
> % cumulative self self total
> time seconds seconds calls ms/call ms/call name
> 39.46 224.65 224.65 cg_record_arc
> 16.40 318.00 93.34 6722992 0.01 0.02 getblk
> 9.02 369.33 51.33 50673121 0.00 0.00 spin_lock_
> 6.67 407.27 37.95 6722669 0.01 0.01 _make_request
> 4.51 432.97 25.70 13445261 0.00 0.00 blk_get_queue
> 2.61 447.83 14.86 long_copy_user
> 2.59 462.56 14.72 mcount
> 2.06 474.27 11.71 cpu_idle
Now, while I don't worry about "getblk()" itself, the request stuff and
blk_get_queue() _can_ be quite an issue even under non-mkfs load, so
that certainly implies that we have some work to do. The area of the
block device request queueing is definitely something that 2.5.x will
start working on (Jens is already doing lots of stuff, obviously).
And your lock profile certainly shows the io_request_lock as a _major_
lock user, although I'm happy to see that contention seems to be
reasonably low. Still, I'd bet that it is worth working on..
Linus
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: The cause of the "VM" performance problem with 2.4.X
2001-08-28 18:52 ` Linus Torvalds
@ 2001-08-28 19:29 ` André Dahlqvist
2001-08-29 13:49 ` Rik van Riel
2001-08-29 8:22 ` Jens Axboe
1 sibling, 1 reply; 14+ messages in thread
From: André Dahlqvist @ 2001-08-28 19:29 UTC (permalink / raw)
To: linux-kernel
Linus Torvalds <torvalds@transmeta.com> wrote:
> Note that the real fix is to make the block devices use the page cache.
> Andrea has patches, but it's a 2.5.x thing.
Linus, could you please make a statement when 2.5.x is likely to start?
A lot of people wonder this and we all know you have some sort of plan
laid out. Tell us about it:-)
--
André Dahlqvist <andre.dahlqvist@telia.com>
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: The cause of the "VM" performance problem with 2.4.X
2001-08-28 18:52 ` Linus Torvalds
2001-08-28 19:29 ` André Dahlqvist
@ 2001-08-29 8:22 ` Jens Axboe
2001-08-29 8:25 ` Jens Axboe
1 sibling, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2001-08-29 8:22 UTC (permalink / raw)
To: Linus Torvalds; +Cc: kevin.vanmaren, linux-kernel
On Tue, Aug 28 2001, Linus Torvalds wrote:
> >Abbreiated/stripped kernprof/gprof output:
> >------------------------------------------
> >
> >Each sample counts as 0.000976562 seconds.
> > % cumulative self self total
> > time seconds seconds calls ms/call ms/call name
> > 39.46 224.65 224.65 cg_record_arc
> > 16.40 318.00 93.34 6722992 0.01 0.02 getblk
> > 9.02 369.33 51.33 50673121 0.00 0.00 spin_lock_
> > 6.67 407.27 37.95 6722669 0.01 0.01 _make_request
> > 4.51 432.97 25.70 13445261 0.00 0.00 blk_get_queue
> > 2.61 447.83 14.86 long_copy_user
> > 2.59 462.56 14.72 mcount
> > 2.06 474.27 11.71 cpu_idle
>
> Now, while I don't worry about "getblk()" itself, the request stuff and
> blk_get_queue() _can_ be quite an issue even under non-mkfs load, so
blk_get_queue() is easy to 'fix', it grabs io_request_lock for no good
reason at all. I think this must have been a failed attempt to protect
switching of queues, however it's obviously very broken in this regard.
So in fact no skin is off our nose for just removing the io_request_lock
in that path. 2.5 will have it properly reference counted...
> And your lock profile certainly shows the io_request_lock as a _major_
> lock user, although I'm happy to see that contention seems to be
> reasonably low. Still, I'd bet that it is worth working on..
Sure is, the bio patches have not had io_request_lock in them for some
time.
--
Jens Axboe
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: The cause of the "VM" performance problem with 2.4.X
2001-08-29 8:22 ` Jens Axboe
@ 2001-08-29 8:25 ` Jens Axboe
0 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2001-08-29 8:25 UTC (permalink / raw)
To: Linus Torvalds; +Cc: kevin.vanmaren, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 717 bytes --]
On Wed, Aug 29 2001, Jens Axboe wrote:
> > > 4.51 432.97 25.70 13445261 0.00 0.00 blk_get_queue
> >
> > Now, while I don't worry about "getblk()" itself, the request stuff and
> > blk_get_queue() _can_ be quite an issue even under non-mkfs load, so
>
> blk_get_queue() is easy to 'fix', it grabs io_request_lock for no good
> reason at all. I think this must have been a failed attempt to protect
> switching of queues, however it's obviously very broken in this regard.
> So in fact no skin is off our nose for just removing the io_request_lock
> in that path. 2.5 will have it properly reference counted...
Linus, will you take this patch to remove io_request_lock in this path?
--
Jens Axboe
[-- Attachment #2: blk-get-queue-lock-1 --]
[-- Type: text/plain, Size: 1734 bytes --]
--- drivers/block/ll_rw_blk.c~ Wed Aug 29 10:22:50 2001
+++ drivers/block/ll_rw_blk.c Wed Aug 29 10:23:12 2001
@@ -131,7 +131,7 @@
return max_sectors[MAJOR(dev)][MINOR(dev)];
}
-inline request_queue_t *__blk_get_queue(kdev_t dev)
+inline request_queue_t *blk_get_queue(kdev_t dev)
{
struct blk_dev_struct *bdev = blk_dev + MAJOR(dev);
@@ -141,22 +141,6 @@
return &blk_dev[MAJOR(dev)].request_queue;
}
-/*
- * NOTE: the device-specific queue() functions
- * have to be atomic!
- */
-request_queue_t *blk_get_queue(kdev_t dev)
-{
- request_queue_t *ret;
- unsigned long flags;
-
- spin_lock_irqsave(&io_request_lock,flags);
- ret = __blk_get_queue(dev);
- spin_unlock_irqrestore(&io_request_lock,flags);
-
- return ret;
-}
-
static int __blk_cleanup_queue(struct list_head *head)
{
struct request *rq;
@@ -1272,7 +1256,6 @@
EXPORT_SYMBOL(end_that_request_last);
EXPORT_SYMBOL(blk_init_queue);
EXPORT_SYMBOL(blk_get_queue);
-EXPORT_SYMBOL(__blk_get_queue);
EXPORT_SYMBOL(blk_cleanup_queue);
EXPORT_SYMBOL(blk_queue_headactive);
EXPORT_SYMBOL(blk_queue_make_request);
--- include/linux/blkdev.h~ Wed Aug 29 10:23:19 2001
+++ include/linux/blkdev.h Wed Aug 29 10:23:29 2001
@@ -149,8 +149,7 @@
extern void grok_partitions(struct gendisk *dev, int drive, unsigned minors, long size);
extern void register_disk(struct gendisk *dev, kdev_t first, unsigned minors, struct block_device_operations *ops, long size);
extern void generic_make_request(int rw, struct buffer_head * bh);
-extern request_queue_t *blk_get_queue(kdev_t dev);
-extern inline request_queue_t *__blk_get_queue(kdev_t dev);
+extern inline request_queue_t *blk_get_queue(kdev_t dev);
extern void blkdev_release_request(struct request *);
/*
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: The cause of the "VM" performance problem with 2.4.X
2001-08-28 19:29 ` André Dahlqvist
@ 2001-08-29 13:49 ` Rik van Riel
0 siblings, 0 replies; 14+ messages in thread
From: Rik van Riel @ 2001-08-29 13:49 UTC (permalink / raw)
To: André Dahlqvist; +Cc: linux-kernel
On Tue, 28 Aug 2001, André Dahlqvist wrote:
> Linus Torvalds <torvalds@transmeta.com> wrote:
>
> > Note that the real fix is to make the block devices use the page cache.
> > Andrea has patches, but it's a 2.5.x thing.
>
> Linus, could you please make a statement when 2.5.x is likely to start?
> A lot of people wonder this and we all know you have some sort of plan
> laid out. Tell us about it:-)
2.4.8-pre* seems to have been the turning point where
experimental stuff started getting introduced ;)
Rik
--
IA64: a worthy successor to i860.
http://www.surriel.com/ http://distro.conectiva.com/
Send all your spam to aardvark@nl.linux.org (spam digging piggy)
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: The cause of the "VM" performance problem with 2.4.X
@ 2001-08-23 17:26 Van Maren, Kevin
0 siblings, 0 replies; 14+ messages in thread
From: Van Maren, Kevin @ 2001-08-23 17:26 UTC (permalink / raw)
To: 'Andrew Morton'; +Cc: 'linux-kernel@vger.kernel.org'
I did a "mkfs" on a "md" of 3 8-disk "md"s (24 disk md). So no
lock contention, and only one "dev" to match, right? [Try to
avoid the problem.]
getblk was holding the lru_list_lock lock 86% of the time.
Was averaging about 25,000 blocks/sec to the disks (vmstat),
but it was often 200,000 over 2 seconds, then 5 seconds idle,
etc. Disk lights didn't come on much (but they all blinked
nicely in sync).
I'll try your patch out and run individual disks again.
Kevin
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: The cause of the "VM" performance problem with 2.4.X
2001-08-23 17:06 Van Maren, Kevin
@ 2001-08-23 17:18 ` Andrew Morton
0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2001-08-23 17:18 UTC (permalink / raw)
To: Van Maren, Kevin; +Cc: 'linux-kernel@vger.kernel.org'
"Van Maren, Kevin" wrote:
>
> It looks like the same patch as yesterday. Did you attach the wrong
> patch?
Drat, sorry - it's on the other machine. Here's a reconstruction:
--- linux-2.4.9/fs/buffer.c Thu Aug 16 12:23:19 2001
+++ linux-akpm/fs/buffer.c Thu Aug 23 10:15:41 2001
@@ -199,7 +199,7 @@ static void write_locked_buffers(struct
* return without it!
*/
#define NRSYNC (32)
-static int write_some_buffers(kdev_t dev)
+static int write_some_buffers(kdev_t dev, struct buffer_head **start_bh)
{
struct buffer_head *next;
struct buffer_head *array[NRSYNC];
@@ -207,6 +207,12 @@ static int write_some_buffers(kdev_t dev
int nr;
next = lru_list[BUF_DIRTY];
+ if (start_bh && *start_bh) {
+ if ((*start_bh)->b_list == BUF_DIRTY)
+ next = *start_bh;
+ brelse(*start_bh);
+ *start_bh = NULL;
+ }
nr = nr_buffers_type[BUF_DIRTY] * 2;
count = 0;
while (next && --nr >= 0) {
@@ -215,8 +221,11 @@ static int write_some_buffers(kdev_t dev
if (dev && bh->b_dev != dev)
continue;
- if (test_and_set_bit(BH_Lock, &bh->b_state))
+ if (test_and_set_bit(BH_Lock, &bh->b_state)) {
+ /* Shouldn't be on BUF_DIRTY */
+ __refile_buffer(bh);
continue;
+ }
if (atomic_set_buffer_clean(bh)) {
__refile_buffer(bh);
get_bh(bh);
@@ -224,6 +233,10 @@ static int write_some_buffers(kdev_t dev
if (count < NRSYNC)
continue;
+ if (start_bh && next) {
+ get_bh(next);
+ *start_bh = next;
+ }
spin_unlock(&lru_list_lock);
write_locked_buffers(array, count);
return -EAGAIN;
@@ -243,9 +256,11 @@ static int write_some_buffers(kdev_t dev
*/
static void write_unlocked_buffers(kdev_t dev)
{
+ struct buffer_head *start_bh = NULL;
do {
spin_lock(&lru_list_lock);
- } while (write_some_buffers(dev));
+ } while (write_some_buffers(dev, &start_bh));
+ brelse(start_bh);
run_task_queue(&tq_disk);
}
@@ -1116,14 +1131,14 @@ int balance_dirty_state(kdev_t dev)
*/
void balance_dirty(kdev_t dev)
{
- int state = balance_dirty_state(dev);
+ int state = balance_dirty_state(NODEV);
if (state < 0)
return;
/* If we're getting into imbalance, start write-out */
spin_lock(&lru_list_lock);
- write_some_buffers(dev);
+ write_some_buffers(NODEV, NULL);
/*
* And if we're _really_ out of balance, wait for
@@ -1132,8 +1147,9 @@ void balance_dirty(kdev_t dev)
* This will throttle heavy writers.
*/
if (state > 0) {
- wait_for_some_buffers(dev);
+ write_some_buffers(dev, NULL);
wakeup_bdflush();
+ wait_for_some_buffers(dev);
}
}
@@ -2607,7 +2623,7 @@ static int sync_old_buffers(void)
bh = lru_list[BUF_DIRTY];
if (!bh || time_before(jiffies, bh->b_flushtime))
break;
- if (write_some_buffers(NODEV))
+ if (write_some_buffers(NODEV, NULL))
continue;
return 0;
}
@@ -2706,7 +2722,7 @@ int bdflush(void *startup)
CHECK_EMERGENCY_SYNC
spin_lock(&lru_list_lock);
- if (!write_some_buffers(NODEV) || balance_dirty_state(NODEV) < 0) {
+ if (!write_some_buffers(NODEV, NULL) || balance_dirty_state(NODEV) < 0) {
wait_for_some_buffers(NODEV);
interruptible_sleep_on(&bdflush_wait);
}
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: The cause of the "VM" performance problem with 2.4.X
@ 2001-08-23 17:06 Van Maren, Kevin
2001-08-23 17:18 ` Andrew Morton
0 siblings, 1 reply; 14+ messages in thread
From: Van Maren, Kevin @ 2001-08-23 17:06 UTC (permalink / raw)
To: 'Andrew Morton'; +Cc: 'linux-kernel@vger.kernel.org'
> > Yes, it helps quite a bit. Still not as good as I'd like: I don't
> > dare try lots of disks yet :-(
>
> Great, thanks. Aside from the lock contention and stuff, how was the
> actual disk throughput affected? You said that the stock kernel
> basically stops doing anything, yes?
It got really bad. It would make "progress", but slower than I
can encode the bits on the platter with a hand magnet ;-)
Now, it's only "pretty bad".
> > Looks like blkdev_put() holds kernel_flag for way too long.
>
> It calls fsync_dev().
Right. Which ends up calling our friend, write_unlocked_buffers(),
which does locking, but it also holds the kernel lock the entire time.
> There are two n^2 loops in buffer.c. There's one on the
> sync_buffers()
> path, which we fixed yesterday. But there's also a potential
> O(2n) path
> in balance_dirty(). So if we're calling mark_buffer_dirty() a lot,
> this becomes quadratic. For this to bite us the BUF_DIRTY list would
> have to be "almost full" of buffers for another device. There's also
> some weird stuff in sync_buffers() - not sure what it's trying to do.
> I'll take that up with the boss when he gets back from the polar bear
> hunt or whatever it is they do over there.
>
> Here's a different patch which addresses the balance_dirty() thing
> as well..
It looks like the same patch as yesterday. Did you attach the wrong
patch? In any event, it doesn't look like it helps balance_dirty()
at all because the location is not retained across multiple calls
to write_some_buffers().
Kevin Van Maren
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: The cause of the "VM" performance problem with 2.4.X
2001-08-23 1:48 Van Maren, Kevin
@ 2001-08-23 16:33 ` Andrew Morton
0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2001-08-23 16:33 UTC (permalink / raw)
To: Van Maren, Kevin; +Cc: 'linux-kernel@vger.kernel.org'
"Van Maren, Kevin" wrote:
>
> > This (rather hastily tested) patch against 2.4.9 should give O(n)
> > behaviour in write_unlocked_buffers(). Does it help?
>
> Yes, it helps quite a bit. Still not as good as I'd like: I don't
> dare try lots of disks yet :-(
Great, thanks. Aside from the lock contention and stuff, how was the
actual disk throughput affected? You said that the stock kernel
basically stops doing anything, yes?
> Here is the new lockmeter output (8 parallel mkfs processes on 4X
> Lion to different disks, entire mkfs execution time).
>
> Looks like blkdev_put() holds kernel_flag for way too long.
>
It calls fsync_dev().
There are two n^2 loops in buffer.c. There's one on the sync_buffers()
path, which we fixed yesterday. But there's also a potential O(2n) path
in balance_dirty(). So if we're calling mark_buffer_dirty() a lot,
this becomes quadratic. For this to bite us the BUF_DIRTY list would
have to be "almost full" of buffers for another device. There's also
some weird stuff in sync_buffers() - not sure what it's trying to do.
I'll take that up with the boss when he gets back from the polar bear
hunt or whatever it is they do over there.
Here's a different patch which addresses the balance_dirty() thing
as well..
--- linux-2.4.9/fs/buffer.c Thu Aug 16 12:23:19 2001
+++ linux-akpm/fs/buffer.c Wed Aug 22 13:16:22 2001
@@ -199,7 +199,7 @@ static void write_locked_buffers(struct
* return without it!
*/
#define NRSYNC (32)
-static int write_some_buffers(kdev_t dev)
+static int write_some_buffers(kdev_t dev, struct buffer_head **start_bh)
{
struct buffer_head *next;
struct buffer_head *array[NRSYNC];
@@ -207,6 +207,12 @@ static int write_some_buffers(kdev_t dev
int nr;
next = lru_list[BUF_DIRTY];
+ if (start_bh && *start_bh) {
+ if ((*start_bh)->b_list == BUF_DIRTY)
+ next = *start_bh;
+ brelse(*start_bh);
+ *start_bh = NULL;
+ }
nr = nr_buffers_type[BUF_DIRTY] * 2;
count = 0;
while (next && --nr >= 0) {
@@ -215,8 +221,11 @@ static int write_some_buffers(kdev_t dev
if (dev && bh->b_dev != dev)
continue;
- if (test_and_set_bit(BH_Lock, &bh->b_state))
+ if (test_and_set_bit(BH_Lock, &bh->b_state)) {
+ /* Shouldn't be on BUF_DIRTY */
+ __refile_buffer(bh);
continue;
+ }
if (atomic_set_buffer_clean(bh)) {
__refile_buffer(bh);
get_bh(bh);
@@ -224,6 +233,10 @@ static int write_some_buffers(kdev_t dev
if (count < NRSYNC)
continue;
+ if (start_bh && next) {
+ get_bh(next);
+ *start_bh = next;
+ }
spin_unlock(&lru_list_lock);
write_locked_buffers(array, count);
return -EAGAIN;
@@ -243,9 +256,11 @@ static int write_some_buffers(kdev_t dev
*/
static void write_unlocked_buffers(kdev_t dev)
{
+ struct buffer_head *start_bh = NULL;
do {
spin_lock(&lru_list_lock);
- } while (write_some_buffers(dev));
+ } while (write_some_buffers(dev, &start_bh));
+ brelse(start_bh);
run_task_queue(&tq_disk);
}
@@ -1117,13 +1132,15 @@ int balance_dirty_state(kdev_t dev)
void balance_dirty(kdev_t dev)
{
int state = balance_dirty_state(dev);
+ struct buffer_head *start_bh = NULL;
if (state < 0)
return;
/* If we're getting into imbalance, start write-out */
spin_lock(&lru_list_lock);
- write_some_buffers(dev);
+ write_some_buffers(dev, &start_bh);
+ brelse(start_bh);
/*
* And if we're _really_ out of balance, wait for
@@ -2607,7 +2624,7 @@ static int sync_old_buffers(void)
bh = lru_list[BUF_DIRTY];
if (!bh || time_before(jiffies, bh->b_flushtime))
break;
- if (write_some_buffers(NODEV))
+ if (write_some_buffers(NODEV, NULL))
continue;
return 0;
}
@@ -2706,7 +2723,7 @@ int bdflush(void *startup)
CHECK_EMERGENCY_SYNC
spin_lock(&lru_list_lock);
- if (!write_some_buffers(NODEV) || balance_dirty_state(NODEV) < 0) {
+ if (!write_some_buffers(NODEV, NULL) || balance_dirty_state(NODEV) < 0) {
wait_for_some_buffers(NODEV);
interruptible_sleep_on(&bdflush_wait);
}
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: The cause of the "VM" performance problem with 2.4.X
@ 2001-08-23 1:48 Van Maren, Kevin
2001-08-23 16:33 ` Andrew Morton
0 siblings, 1 reply; 14+ messages in thread
From: Van Maren, Kevin @ 2001-08-23 1:48 UTC (permalink / raw)
To: 'Andrew Morton'; +Cc: 'linux-kernel@vger.kernel.org'
> This (rather hastily tested) patch against 2.4.9 should give O(n)
> behaviour in write_unlocked_buffers(). Does it help?
Yes, it helps quite a bit. Still not as good as I'd like: I don't
dare try lots of disks yet :-(
Here is the new lockmeter output (8 parallel mkfs processes on 4X
Lion to different disks, entire mkfs execution time).
Looks like blkdev_put() holds kernel_flag for way too long.
Looking under the lru_list_lock, you can see several functions that hold
the spinlock > 300ms. getblk only holds it for a max of 7ms, but it gets
called so much it dominates the execution time.
____________________________________________________________________________
_______________
System: Linux vanmaren.slc.unisys.com 2.4.8 #13 SMP Wed Aug 22 18:34:15 UTC
2001 ia64
Total counts
All (4) CPUs
Selecting locks that meet ALL of the following:
utilization : > 1.00%
Start time: Wed Aug 22 19:10:17 2001
End time: Wed Aug 22 19:11:30 2001
Delta Time: 72.67 sec.
Hash table slots in use: 254.
Global read lock slots in use: 101.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - -
SPINLOCKS HOLD WAIT
UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN
RJECT NAME
29.4% 4.8us(2809ms) 40us(1780ms)(62.4%) 15255259 70.6% 29.4%
0.00% *TOTAL*
3.9% 0.06% 29us(5958us) 0us 96244 100% 0%
0.06% global_bh_lock
3.9% 0.06% 29us(5958us) 0us 96244 100% 0%
0.06% bh_action+0x40
6.6% 9.1% 1.3us( 111us) 14us( 234us)( 1.7%) 3773504 90.9% 9.1%
0% io_request_lock
0.00% 10.8% 0.5us( 2.6us) 13us( 68us)(0.00%) 655 89.2% 10.8%
0% __get_request_wait+0xb0
5.4% 3.4% 2.1us( 77us) 9.8us( 234us)(0.21%) 1846091 96.6% 3.4%
0% __make_request+0x270
0.05% 18.6% 2.6us( 111us) 6.2us( 123us)(0.01%) 13228 81.4% 18.6%
0% ahc_linux_isr+0x3d0
0.77% 14.4% 0.3us( 3.8us) 16us( 226us)( 1.4%) 1845765 85.6% 14.4%
0% blk_get_queue+0x50
0.00% 18.6% 3.7us( 19us) 12us( 117us)(0.00%) 575 81.4% 18.6%
0% generic_unplug_device+0x40
0.29% 22.3% 11us( 92us) 11us( 154us)(0.02%) 18488 77.7% 22.3%
0% scsi_dispatch_cmd+0x2b0
0.01% 18.5% 0.4us( 3.3us) 5.8us( 132us)(0.01%) 16157 81.5% 18.5%
0% scsi_finish_command+0x60
0.05% 17.5% 2.1us( 95us) 6.3us( 95us)(0.01%) 16157 82.5% 17.5%
0% scsi_queue_next_request+0x40
0.07% 11.8% 2.9us( 37us) 3.4us( 107us)(0.00%) 16388 88.2% 11.8%
0% scsi_request_fn+0x510
8.0% 0.45% 54us(2809ms) 25us(2394us)(0.00%) 107442 99.5% 0.45%
0% kernel_flag
0.00% 0% 0.4us( 0.5us) 0us 3 100% 0%
0% acct_process+0x140
0.00% 56.2% 5.1us( 12us) 559us(2394us)(0.00%) 16 43.8% 56.2%
0% blkdev_open+0xb0
7.6% 20.0% 553ms(2809ms) 46us( 59us)(0.00%) 10 80.0% 20.0%
0% blkdev_put+0xd0
0.00% 0% 2.3us( 3.3us) 0us 8 100% 0%
0% chrdev_open+0xc0
0.00% 0% 52us( 63us) 0us 3 100% 0%
0% do_exit+0x1f0
0.00% 0% 0.5us( 0.9us) 0us 63 100% 0%
0% ext2_discard_prealloc+0x50
0.00% 14.9% 1.9us( 41us) 14us( 28us)(0.00%) 74 85.1% 14.9%
0% ext2_get_block+0xb0
0.00% 0% 7.0us( 8.2us) 0us 6 100% 0%
0% fsync_dev+0x80
0.00% 0.40% 1.3us( 9.9us) 20us( 57us)(0.00%) 1260 99.6% 0.40%
0% posix_lock_file+0x2d0
0.19% 0.15% 4.9us( 383us) 8.2us( 87us)(0.00%) 27984 99.9% 0.15%
0% real_lookup+0x200
0.00% 27.3% 6.4us( 17us) 15us( 20us)(0.00%) 11 72.7% 27.3%
0% schedule+0x8c0
0.00% 0% 219us(1031us) 0us 10 100% 0%
0% sync_old_buffers+0x50
0.00% 22.2% 5.4us( 11us) 30us( 34us)(0.00%) 9 77.8% 22.2%
0% sys_ioctl+0x90
0.03% 0.54% 0.3us( 176us) 14us( 580us)(0.00%) 73179 99.5% 0.54%
0% sys_lseek+0xf0
0.11% 0.48% 20us( 586us) 34us( 172us)(0.00%) 4165 99.5% 0.48%
0% tty_write+0x4a0
0.02% 0.16% 24us( 64us) 2.4us( 2.4us)(0.00%) 640 99.8% 0.16%
0% vfs_readdir+0x1c0
0.00% 0% 31us( 31us) 0us 1 100% 0%
0% vfs_statfs+0xc0
78.3% 87.2% 12us(1780ms) 43us(1780ms)(60.6%) 4681456 12.8% 87.2%
0% lru_list_lock
4.8% 30.0% 350ms(1780ms) 61us( 77us)(0.00%) 10 70.0% 30.0%
0% __invalidate_buffers+0x40
50.8% 78.5% 15us(7412us) 34us( 366ms)(21.6%) 2382129 21.5% 78.5%
0% getblk+0x40
0.79% 66.7% 64ms( 366ms) 36us( 97us)(0.00%) 9 33.3% 66.7%
0% kupdate+0x190
6.8% 97.0% 2.2us(1204us) 34us( 366ms)(25.2%) 2241541 3.0% 97.0%
0% refile_buffer+0x30
0.80% 90.9% 15us(3212us) 401us(1780ms)( 4.9%) 38766 9.1% 90.9%
0% sync_old_buffers+0x100
3.4% 12.5% 77ms( 327ms) 22us( 54us)(0.00%) 32 87.5% 12.5%
0% wait_for_locked_buffers+0x40
11.0% 18.6% 420us( 344ms) 7344us(1780ms)( 8.9%) 18969 81.4% 18.6%
0% write_unlocked_buffers+0x40
1.3% 0.45% 0.4us(3859us) 2.0us( 180us)(0.01%) 2381262 99.6% 0.45%
0% getblk+0x2b0
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - -
RWLOCK READS HOLD MAX RDR BUSY PERIOD WAIT
UTIL CON MEAN RDRS MEAN( MAX ) MEAN( MAX )( %CPU) TOTAL
NOWAIT SPIN NAME
0.00% 1.2us( 1.5us)(0.00%) 229513
100% 0.00% *TOTAL*
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - -
RWLOCK WRITES HOLD WAIT (ALL) WAIT (WW)
UTIL CON MEAN( MAX ) MEAN( MAX )( %CPU) MEAN( MAX ) TOTAL
NOWAIT SPIN( WW ) NAME
36.6% 0% 9.0us(3880us) 0us 0us 2942425
100% 0%( 0%) hash_table_lock
1.1% 0% 1.4us( 6.5us) 0us 0us 560296
100% 0%( 0%) __invalidate_buffers+0x250
35.5% 0% 11us(3880us) 0us 0us 2382129
100% 0%( 0%) getblk+0x60
____________________________________________________________________________
_____________________________________________
Number of read locks found=0
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: The cause of the "VM" performance problem with 2.4.X
@ 2001-08-22 22:23 Van Maren, Kevin
0 siblings, 0 replies; 14+ messages in thread
From: Van Maren, Kevin @ 2001-08-22 22:23 UTC (permalink / raw)
To: 'Andrew Morton'; +Cc: 'linux-kernel@vger.kernel.org'
Thanks Andrew! I'll try out your patch shortly, and I'll let you
know how it goes.
> Note how fsync_dev() passes the target device to sync_buffers(). But
> the dirty buffer list is global. So to write out the dirty buffers
> for a particular device, write_locked_buffers() has to do a linear
> walk of the dirty buffers for *other* devices to find the target
> device.
I had just figured that out. I hadn't changed the code yet, though.
The other thing that exacerbates the start-from-0 approach is only
flushing 32 blocks at a time. Especially if we take a long time to
find the first one because of the linear search, not doing much
work once we get there is also a problem.
> And write_unlocked_buffers() uses a quite common construct - it
> scans a list but when it drops the lock, it restarts the scan
> from the start of the list. (We do this all over the kernel, and
> it keeps on biting us).
It sounds like we want a per-device list, not a global (linear) one.
But I'm sure there are other places in the kernel where the per-
device list would introduce problems (but would make things like
sync() better, since we could flush all devices more in parallel).
But there are probably also problems there when scaling to 100's or
1000's of disks.
> So if the dirty buffer list has 10,000 buffers for device A and
> then 10,000 buffers for device B, and you call fsync_dev(B),
> we end up traversing the 10,000 buffers of device A 10,000/32 times,
> which is a lot.
>
> In fact, write_unlocked_buffers(A) shoots itself in the foot by
> moving buffers for device A onto BUF_LOCKED, and then restarting the
> scan. So of *course* we end up with zillions on non-A buffers at the
> head of the list.
Is it safe to restart from the middle of the list after dropping
the lock? It looks like you try to solve that problem by only using
the hint if the block is still on the dirty list.
I was going to cache that per-device info globally as a quick hack;
I was thinking about hanging it off the kdev_t, since I didn't want
a fixed-size array in write_some_buffers(), but your approach of just
having the caller track it works too.
I'm off to try the patch...
But before I go, here is another profile run, but without your patch,
to validate your hypothesis (seem to have lost several interrupts)
[ia64_spinlock_contention is time spinning on a spinlock (you know
which one); myskupbuffer1 is called each time it "skips" a buffer
in the LRU list (no dev match).
Note that in this case, 26000 elements are skipped on the "average"
call to write_some_buffers.]
Each sample counts as 0.000976562 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
50.36 40.43 40.43
ia64_spinlock_contention
23.98 59.68 19.25 2658 7.24 8.17 write_some_buffers
17.10 73.40 13.72 cg_record_arc
4.48 77.00 3.59 mcount
1.36 78.09 1.10 85039 0.01 0.01 _make_request
0.82 78.75 0.66 69385716 0.00 0.00 myskipbuffer1
0.77 79.37 0.62 85039 0.01 0.01 blk_get_queue
index % time self children called name
<spontaneous>
[1] 52.7 40.43 0.00 ia64_spinlock_contention [1]
-----------------------------------------------
2.68 0.34 370/2658 sync_old_buffers [11]
16.57 2.12 2288/2658 write_unlocked_buffers [6]
[2] 28.3 19.25 2.47 2658 write_some_buffers [2]
0.00 1.77 2658/2658 write_locked_buffers [12]
0.66 0.00 69385716/69385716 myskipbuffer1 [16]
0.01 0.01 85056/123329 _refile_buffer [54]
0.01 0.00 85056/130255 _insert_into_lru_list [70]
0.01 0.00 85056/85056 myatomic_set_buffer_clean
[93]
0.00 0.00 85056/85056 my__refile_buffer [104]
0.00 0.00 85056/85056 mytest_and_set_bit [113]
0.00 0.00 2658/418442 spin_unlock_ [97]
-----------------------------------------------
Kevin Van Maren
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: The cause of the "VM" performance problem with 2.4.X
2001-08-22 5:31 Van Maren, Kevin
@ 2001-08-22 20:19 ` Andrew Morton
0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2001-08-22 20:19 UTC (permalink / raw)
To: Van Maren, Kevin; +Cc: 'linux-kernel@vger.kernel.org'
"Van Maren, Kevin" wrote:
>
> ...
>
> I've been running Linux on IA64 (4 CPU LION, 8GB RAM). 2.4.4+IA64 patches through
> 2.4.8+IA64 patches all exhibit "horiffic" I/O behavior [disks are basically inactive,
> with occasional flickers, but the CPUs are pegged at 100% system time] when writing
> to multiple disks using multiple CPUs. The easiest way for me to reproduce the
> problem is to run parallel "mkfs" processes (I use 4 SCSI disks).
>
> First thing to do is to profile the kernel, to see why all 4 of my fast IA64
> processors are pegged at 99%+ in the kernel (and see what they are doing).
> So I get the kernel profile patches from SGI (http://oss.sgi.com/projects/kernprof/)
> and patch my kernel. Profile 30 seconds during the "mkfs" process on 4 disks
> (plus a "sync" part way through for kicks). Below is the "interesting" part
> of the output (truncated for brevity):
Note how fsync_dev() passes the target device to sync_buffers(). But
the dirty buffer list is global. So to write out the dirty buffers
for a particular device, write_locked_buffers() has to do a linear
walk of the dirty buffers for *other* devices to find the target
device.
And write_unlocked_buffers() uses a quite common construct - it
scans a list but when it drops the lock, it restarts the scan
from the start of the list. (We do this all over the kernel, and
it keeps on biting us).
So if the dirty buffer list has 10,000 buffers for device A and
then 10,000 buffers for device B, and you call fsync_dev(B),
we end up traversing the 10,000 buffers of device A 10,000/32 times,
which is a lot.
In fact, write_unlocked_buffers(A) shoots itself in the foot by
moving buffers for device A onto BUF_LOCKED, and then restarting the
scan. So of *course* we end up with zillions on non-A buffers at the
head of the list.
fsync_dev() and balance_dirty() are the culprits in this scenario - I'd
be surprised if sys_sync() displayed similar quadratic behaviour. (Well, it
would do so if there were a lot of locked buffers on BUF_DIRTY, but there
usually aren't).
This (rather hastily tested) patch against 2.4.9 should give O(n)
behaviour in write_unlocked_buffers(). Does it help?
--- linux-2.4.9/fs/buffer.c Thu Aug 16 12:23:19 2001
+++ linux-akpm/fs/buffer.c Wed Aug 22 13:16:22 2001
@@ -199,7 +199,7 @@ static void write_locked_buffers(struct
* return without it!
*/
#define NRSYNC (32)
-static int write_some_buffers(kdev_t dev)
+static int write_some_buffers(kdev_t dev, struct buffer_head **start_bh)
{
struct buffer_head *next;
struct buffer_head *array[NRSYNC];
@@ -207,6 +207,12 @@ static int write_some_buffers(kdev_t dev
int nr;
next = lru_list[BUF_DIRTY];
+ if (start_bh && *start_bh) {
+ if ((*start_bh)->b_list == BUF_DIRTY)
+ next = *start_bh;
+ brelse(*start_bh);
+ *start_bh = NULL;
+ }
nr = nr_buffers_type[BUF_DIRTY] * 2;
count = 0;
while (next && --nr >= 0) {
@@ -215,8 +221,11 @@ static int write_some_buffers(kdev_t dev
if (dev && bh->b_dev != dev)
continue;
- if (test_and_set_bit(BH_Lock, &bh->b_state))
+ if (test_and_set_bit(BH_Lock, &bh->b_state)) {
+ /* Shouldn't be on BUF_DIRTY */
+ __refile_buffer(bh);
continue;
+ }
if (atomic_set_buffer_clean(bh)) {
__refile_buffer(bh);
get_bh(bh);
@@ -224,6 +233,10 @@ static int write_some_buffers(kdev_t dev
if (count < NRSYNC)
continue;
+ if (start_bh && next) {
+ get_bh(next);
+ *start_bh = next;
+ }
spin_unlock(&lru_list_lock);
write_locked_buffers(array, count);
return -EAGAIN;
@@ -243,9 +256,11 @@ static int write_some_buffers(kdev_t dev
*/
static void write_unlocked_buffers(kdev_t dev)
{
+ struct buffer_head *start_bh = NULL;
do {
spin_lock(&lru_list_lock);
- } while (write_some_buffers(dev));
+ } while (write_some_buffers(dev, &start_bh));
+ brelse(start_bh);
run_task_queue(&tq_disk);
}
@@ -1117,13 +1132,15 @@ int balance_dirty_state(kdev_t dev)
void balance_dirty(kdev_t dev)
{
int state = balance_dirty_state(dev);
+ struct buffer_head *start_bh = NULL;
if (state < 0)
return;
/* If we're getting into imbalance, start write-out */
spin_lock(&lru_list_lock);
- write_some_buffers(dev);
+ write_some_buffers(dev, &start_bh);
+ brelse(start_bh);
/*
* And if we're _really_ out of balance, wait for
@@ -2607,7 +2624,7 @@ static int sync_old_buffers(void)
bh = lru_list[BUF_DIRTY];
if (!bh || time_before(jiffies, bh->b_flushtime))
break;
- if (write_some_buffers(NODEV))
+ if (write_some_buffers(NODEV, NULL))
continue;
return 0;
}
@@ -2706,7 +2723,7 @@ int bdflush(void *startup)
CHECK_EMERGENCY_SYNC
spin_lock(&lru_list_lock);
- if (!write_some_buffers(NODEV) || balance_dirty_state(NODEV) < 0) {
+ if (!write_some_buffers(NODEV, NULL) || balance_dirty_state(NODEV) < 0) {
wait_for_some_buffers(NODEV);
interruptible_sleep_on(&bdflush_wait);
}
^ permalink raw reply [flat|nested] 14+ messages in thread
* The cause of the "VM" performance problem with 2.4.X
@ 2001-08-22 5:31 Van Maren, Kevin
2001-08-22 20:19 ` Andrew Morton
0 siblings, 1 reply; 14+ messages in thread
From: Van Maren, Kevin @ 2001-08-22 5:31 UTC (permalink / raw)
To: 'linux-kernel@vger.kernel.org'
There has been quite a bit of discussion on VM-related problems on the list
(but I haven't been able to follow all of it); hopefully I can add some
clarity
to the problem. Unfortunatly, I haven't fixed it yet (give me some
time...or
fix it and send me a patch ;-)
I've been running Linux on IA64 (4 CPU LION, 8GB RAM). 2.4.4+IA64 patches
through
2.4.8+IA64 patches all exhibit "horiffic" I/O behavior [disks are basically
inactive,
with occasional flickers, but the CPUs are pegged at 100% system time] when
writing
to multiple disks using multiple CPUs. The easiest way for me to reproduce
the
problem is to run parallel "mkfs" processes (I use 4 SCSI disks).
First thing to do is to profile the kernel, to see why all 4 of my fast IA64
processors are pegged at 99%+ in the kernel (and see what they are doing).
So I get the kernel profile patches from SGI
(http://oss.sgi.com/projects/kernprof/)
and patch my kernel. Profile 30 seconds during the "mkfs" process on 4
disks
(plus a "sync" part way through for kicks). Below is the "interesting" part
of the output (truncated for brevity):
Flat profile:
Each sample counts as 0.000976562 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
57.49 69.66 69.66 4 17414.79 25262.34
write_unlocked_buffers
22.92 97.43 27.78 15115 1.84 2.39 write_some_buffers
4.77 103.22 5.78 4 1445.31 2624.65 sync_old_buffers
3.64 107.62 4.41 483617 0.01 0.01 _make_request
3.30 111.62 4.00 cg_record_arc
2.92 115.17 3.54 483618 0.01 0.01 blk_get_queue
1.62 117.13 1.96 2975 0.66 0.66 getblk
1.55 119.00 1.88 2910 0.65 0.65 refile_buffer
[. . .]
index % time self children called name
<spontaneous>
[1] 86.9 0.00 105.12 ia64_ret_from_syscall [1]
0.00 75.79 3/3 sys_fsync [5]
0.00 25.26 1/1 sys_sync [7]
0.00 3.77 1195/1195 sys_write [15]
0.00 0.14 2306/2306 sys_read [38]
0.00 0.07 2088/2088 sys_open [43]
[. . .]
-----------------------------------------------
0.00 25.26 1/4 sys_sync [7]
0.00 75.79 3/4 sys_fsync [5]
[2] 83.5 0.00 101.05 4 fsync_dev [2]
0.00 101.05 4/4 sync_buffers [3]
-----------------------------------------------
0.00 101.05 4/4 fsync_dev [2]
[3] 83.5 0.00 101.05 4 sync_buffers [3]
69.66 31.39 4/4 write_unlocked_buffers [4]
-----------------------------------------------
69.66 31.39 4/4 sync_buffers [3]
[4] 83.5 69.66 31.39 4 write_unlocked_buffers [4]
24.18 7.21 13157/15115 write_some_buffers [6]
-----------------------------------------------
0.00 75.79 3/3 ia64_ret_from_syscall [1]
[5] 62.7 0.00 75.79 3 sys_fsync [5]
0.00 75.79 3/4 fsync_dev [2]
0.00 0.00 3/5429 fget [354]
0.00 0.00 3/38 filemap_fdatasync [479]
0.00 0.00 3/3 block_fsync [577]
-----------------------------------------------
3.60 1.07 1958/15115 sync_old_buffers [9]
24.18 7.21 13157/15115 write_unlocked_buffers [4]
[6] 29.8 27.78 8.29 15115 write_some_buffers [6]
0.01 8.16 15115/15115 write_locked_buffers [10]
0.04 0.04 483680/567312 _refile_buffer [41]
0.04 0.00 483680/570200 _insert_into_lru_list [57]
-----------------------------------------------
0.00 25.26 1/1 ia64_ret_from_syscall [1]
[7] 20.9 0.00 25.26 1 sys_sync [7]
0.00 25.26 1/4 fsync_dev [2]
-----------------------------------------------
<spontaneous>
[8] 8.7 0.01 10.52 kupdate [8]
5.78 4.72 4/4 sync_old_buffers [9]
0.00 0.02 3/3 wait_for_buffers [70]
0.00 0.00 3/115 schedule_timeout [123]
-----------------------------------------------
5.78 4.72 4/4 kupdate [8]
[9] 8.7 5.78 4.72 4 sync_old_buffers [9]
3.60 1.07 1958/15115 write_some_buffers [6]
0.00 0.05 4/4 sync_unlocked_inodes [53]
0.00 0.00 4/4 sync_supers [575]
-----------------------------------------------
0.01 8.16 15115/15115 write_some_buffers [6]
[10] 6.8 0.01 8.16 15115 write_locked_buffers [10]
0.14 8.02 483617/483618 submit_bh [11]
-----------------------------------------------
0.00 0.00 1/483618 ll_rw_block [269]
0.14 8.02 483617/483618 write_locked_buffers [10]
[11] 6.7 0.14 8.02 483618 submit_bh [11]
0.06 7.96 483618/483618 generic_make_request [12]
-----------------------------------------------
0.06 7.96 483618/483618 submit_bh [11]
[12] 6.6 0.06 7.96 483618 generic_make_request [12]
4.41 0.01 483617/483617 _make_request [13]
3.54 0.00 483618/483618 blk_get_queue [17]
-----------------------------------------------
Okay, that was helpful. But Yikes! Why would write_unlocked_buffers take
so long?
It MUST be a locking problem: all it does it acquire a lock and
write_some_buffers().
That's fine, I'll profile the locks using SGI's handy-dandy lockmeter lock
profiling code (http://oss.sgi.com/projects/lockmeter/). Hack it up to
apply
to my hacked 2.4.8 kernel, and away we go. Run over 30 seconds of 4*mkfs.
Goes noticably faster than when I was profiling the whole kernel...
Dumping all locks that have > 1% "utilization" gives:
____________________________________________________________________________
_______________
System: Linux vanmaren.slc.unisys.com 2.4.8 #5 SMP Tue Aug 21 21:27:10 UTC
2001 ia64
Total counts
All (4) CPUs
Selecting locks that meet ALL of the following:
utilization : > 1.00%
Start time: Tue Aug 21 22:05:45 2001
End time: Tue Aug 21 22:06:16 2001
Delta Time: 30.19 sec.
Hash table slots in use: 228.
Global read lock slots in use: 71.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - -
SPINLOCKS HOLD WAIT
UTIL CON MEAN( MAX ) MEAN( MAX )(% CPU) TOTAL NOWAIT SPIN
RJECT NAME
26.9% 24us( 49ms) 241us( 59ms)(71.4%) 1338023 73.1% 26.8%
0.09% *TOTAL*
2.4% 3.3% 19us(2035us) 0us 37909 96.7% 0%
3.3% global_bh_lock
2.4% 3.3% 19us(2035us) 0us 37909 96.7% 0%
3.3% bh_action+0x40
5.2% 32.1% 1.5us( 69us) 9.4us( 228us)( 2.7%) 1072409 67.9% 32.1%
0% io_request_lock
0.00% 59.9% 0.9us( 2.7us) 13us( 115us)(0.00%) 322 40.1% 59.9%
0% __get_request_wait+0xb0
4.0% 19.9% 2.3us( 69us) 4.9us( 196us)(0.42%) 525687 80.1% 19.9%
0% __make_request+0x270
0.03% 18.5% 2.3us( 10us) 12us( 105us)(0.01%) 3531 81.5% 18.5%
0% ahc_linux_isr+0x3d0
0.92% 44.9% 0.5us( 3.9us) 11us( 228us)( 2.2%) 525536 55.1% 44.9%
0% blk_get_queue+0x50
0.00% 16.5% 2.6us( 18us) 7.7us( 42us)(0.00%) 243 83.5% 16.5%
0% generic_unplug_device+0x40
0.20% 21.3% 14us( 43us) 13us( 111us)(0.01%) 4399 78.7% 21.3%
0% scsi_dispatch_cmd+0x2b0
0.01% 16.0% 0.5us( 3.5us) 14us( 166us)(0.01%) 4146 84.0% 16.0%
0% scsi_finish_command+0x60
0.03% 16.8% 1.8us( 44us) 13us( 104us)(0.01%) 4146 83.2% 16.8%
0% scsi_queue_next_request+0x40
0.04% 6.4% 2.9us( 26us) 4.6us( 72us)(0.00%) 4399 93.6% 6.4%
0% scsi_request_fn+0x510
97.0% 60.3% 1253us( 49ms) 5886us( 59ms)(68.7%) 23383 39.7% 60.3%
0% lru_list_lock
0.00% 0% 1.0us( 1.0us) 0us 1 100% 0%
0% buffer_insert_inode_queue+0x30
0.15% 11.4% 13us( 22us) 4396us( 30ms)( 1.5%) 3503 88.6% 11.4%
0% getblk+0x40
0.33% 33.3% 33ms( 49ms) 131us( 131us)(0.00%) 3 66.7% 33.3%
0% kupdate+0x190
0.00% 14.6% 0.4us( 2.8us) 3865us( 30ms)( 1.6%) 3447 85.4% 14.6%
0% refile_buffer+0x30
0.05% 60.8% 15us( 26us) 1256us(8562us)(0.66%) 1037 39.2% 60.8%
0% sync_old_buffers+0x100
0.00% 0% 6.6us( 11us) 0us 2 100% 0%
0% try_to_free_buffers+0x90
96.5% 81.6% 1893us(9616us) 6247us( 59ms)(65.0%) 15390 18.4% 81.6%
0% write_unlocked_buffers+0x40
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - -
RWLOCK READS HOLD MAX RDR BUSY PERIOD WAIT
UTIL CON MEAN RDRS MEAN( MAX ) MEAN( MAX )( %CPU) TOTAL
NOWAIT SPIN NAME
0.01% 1.2us( 1.2us)(0.00%) 14867
100% 0.01% *TOTAL*
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - -
RWLOCK WRITES HOLD WAIT (ALL) WAIT (WW)
UTIL CON MEAN( MAX ) MEAN( MAX )( %CPU) MEAN( MAX ) TOTAL
NOWAIT SPIN( WW ) NAME
0% 0.8us( 39us) 0us 0us 72355
100% 0%( 0%) *TOTAL*
____________________________________________________________________________
_____________________________________________
Number of read locks found=0
Yep, lru_list_lock is the problem, and it is exacerbated by
write_unlocked_buffers holding it
for SO long.
The code in question (2.4.8 that I was testing with) is:
static void write_unlocked_buffers(kdev_t dev)
{
do {
spin_lock(&lru_list_lock);
} while (write_some_buffers(dev));
run_task_queue(&tq_disk);
}
/*
* Write some buffers from the head of the dirty queue.
*
* This must be called with the LRU lock held, and will
* return without it!
*/
#define NRSYNC (32)
static int write_some_buffers(kdev_t dev)
{
struct buffer_head *next;
struct buffer_head *array[NRSYNC];
unsigned int count;
int nr;
next = lru_list[BUF_DIRTY];
nr = nr_buffers_type[BUF_DIRTY] * 2;
count = 0;
while (next && --nr >= 0) {
struct buffer_head * bh = next;
next = bh->b_next_free;
if (dev && bh->b_dev != dev)
continue;
if (test_and_set_bit(BH_Lock, &bh->b_state))
continue;
if (atomic_set_buffer_clean(bh)) {
__refile_buffer(bh);
get_bh(bh);
array[count++] = bh;
if (count < NRSYNC)
continue;
spin_unlock(&lru_list_lock);
write_locked_buffers(array, count);
return -EAGAIN;
}
unlock_buffer(bh);
__refile_buffer(bh);
}
spin_unlock(&lru_list_lock);
if (count)
write_locked_buffers(array, count);
return 0;
}
It looks like more investigation is in order, but it appears that the
lru_list_lock
is held for too long, and that write_some_buffers() is a good candidate for
SMP
improvement. Tomorrow, I'll track down where the time is going in
write_some_buffers().
Kevin Van Maren
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2001-08-29 13:50 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-08-28 17:35 The cause of the "VM" performance problem with 2.4.X Van Maren, Kevin
2001-08-28 18:52 ` Linus Torvalds
2001-08-28 19:29 ` André Dahlqvist
2001-08-29 13:49 ` Rik van Riel
2001-08-29 8:22 ` Jens Axboe
2001-08-29 8:25 ` Jens Axboe
-- strict thread matches above, loose matches on Subject: below --
2001-08-23 17:26 Van Maren, Kevin
2001-08-23 17:06 Van Maren, Kevin
2001-08-23 17:18 ` Andrew Morton
2001-08-23 1:48 Van Maren, Kevin
2001-08-23 16:33 ` Andrew Morton
2001-08-22 22:23 Van Maren, Kevin
2001-08-22 5:31 Van Maren, Kevin
2001-08-22 20:19 ` Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).