linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: The cause of the "VM" performance problem with 2.4.X
@ 2001-08-28 17:35 Van Maren, Kevin
  2001-08-28 18:52 ` Linus Torvalds
  0 siblings, 1 reply; 14+ messages in thread
From: Van Maren, Kevin @ 2001-08-28 17:35 UTC (permalink / raw)
  To: 'Andrew Morton'; +Cc: 'linux-kernel@vger.kernel.org'

Okay, here are some more results with Morton's "2nd" patch,
running 2.4.8,IA64-010815 on 4X IA64 (Lion) with 8GB RAM,
733MHz B3 (4MB) CPUs.  3 Adaptec 39160 channels, each with
8 (18GB) Cheetah drives, create 4 md's using 2 disks from
each channel.  Also using Gibbs' 6.2.0 adaptec driver.
I tried to extract the "important" info to keep this message
relatively small.  Meant to get this out earlier...

It looks like "getblk" could use some additional optimization.
More info to follow, but even doing one "mkfs" at a time,
the lru_list_lock is held over 82% of the time, mostly by
getblk().


SMP test case: 4 parallel mkfs (each on 6-disk md):
===================================================
  "start"
  mkfs /dev/md0 &
  mkfs /dev/md1 &
  mkfs /dev/md2 &
  mkfs /dev/md3
  "stop"

Abbreiated/stripped kernprof/gprof output:
------------------------------------------

Each sample counts as 0.000976562 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 39.46    224.65   224.65                             cg_record_arc
 16.40    318.00    93.34  6722992     0.01     0.02  getblk
  9.02    369.33    51.33 50673121     0.00     0.00  spin_lock_
  6.67    407.27    37.95  6722669     0.01     0.01  _make_request
  4.51    432.97    25.70 13445261     0.00     0.00  blk_get_queue
  2.61    447.83    14.86                             long_copy_user
  2.59    462.56    14.72                             mcount
  2.06    474.27    11.71                             cpu_idle
  1.35    481.97     7.70 26887936     0.00     0.00  _insert_into_lru_list
  1.23    489.00     7.02 12463048     0.00     0.00  write_lock_
  1.08    495.15     6.15 12463047     0.00     0.00  write_unlock_
  0.94    500.52     5.37        7   767.72  2342.72  _invalidate_buffers
  0.89    505.56     5.04    72518     0.07     0.08  scsi_dispatch_cmd
  0.83    510.26     4.70   216716     0.02     0.65  block_write
  0.77    514.63     4.37 20169809     0.00     0.00  _refile_buffer
  0.74    518.82     4.19    68098     0.06     0.08  scsi_init_io_vc
  0.73    522.96     4.15    58177     0.07     0.34  _scsi_end_request
  0.64    526.60     3.64 25304057     0.00     0.00  _remove_from_lru_list
  0.59    529.97     3.36      718     4.68     4.68  si_swapinfo
  0.56    533.17     3.21  5644224     0.00     0.00  end_buffer_io_sync
  0.40    535.46     2.29    43581     0.05     0.06  ahc_linux_isr
  0.36    537.50     2.04   210123     0.01     0.45  write_some_buffers
  0.35    539.52     2.02  6729267     0.00     0.00  _remove_from_free_list
  0.34    541.46     1.94 13445412     0.00     0.00  balance_dirty_state
[misc "interesting" others]
  0.09    557.35     0.49      116     4.22    41.74  wait_for_buffers
  0.06    560.37     0.33    58177     0.01     0.35  scsi_finish_command
  0.02    565.67     0.14       54     2.57  1655.62  sync_old_buffers
  0.01    567.60     0.04       42     1.00    18.82  page_launder
  0.00    569.24     0.00       43     0.05   146.48  write_unlocked_buffers
  0.00    569.29     0.00       42     0.00    19.44  do_try_to_free_pages
  0.00    569.29     0.00       29     0.00   306.44  sync_buffers
  0.00    569.29     0.00       28     0.00    92.44
wait_for_locked_buffers
  0.00    569.29     0.00       15     0.00   306.48  fsync_dev
  0.00    569.29     0.00        8     0.00   574.65  sys_fsync
  0.00    569.29     0.00        7     0.00  2955.64  blkdev_put

index % time    self  children    called     name
[1]     40.5  224.65    0.00                 cg_record_arc [1]

[2]     32.3    0.02  179.06                 ia64_ret_from_syscall [2]
                0.10  150.24  232733/232733      sys_write [3]

                0.10  150.24  232733/232733      ia64_ret_from_syscall [2]
[3]     27.1    0.10  150.24  232733         sys_write [3]
                4.70  136.37  216716/216716      block_write [4]

                4.70  136.37  216716/216716      sys_write [3]
[4]     25.4    4.70  136.37  216716         block_write [4]
               93.34   26.66 6722748/6722992     getblk [5]
                1.16   11.57 6722748/6723007     mark_buffer_dirty [30]
                0.32    2.90 13445336/13445412     balance_dirty [54]
                0.40    0.00 6722748/6736794     _brelse [112]

                0.00    0.00      32/6722992     block_read [391]
                0.00    0.00     212/6722992     bread [298]
               93.34   26.66 6722748/6722992     block_write [4]
[5]     21.6   93.34   26.66 6722992         getblk [5]
               13.67    0.00 13492508/50673121     spin_lock_ [12]
                3.80    0.00 6747334/12463048     write_lock_ [40]
                3.33    0.00 6747334/12463047     write_unlock_ [42]
                2.02    0.00 6720832/6729267     _remove_from_free_list [69]
                1.93    0.00 6720832/26887936     _insert_into_lru_list [37]
                0.00    0.76   24342/24342       refill_freelist [90]
                0.74    0.00 6720832/6720832     _insert_into_queues [92]
                0.26    0.00 13492508/51086468     spin_unlock_ [78]
                0.16    0.00 6720832/6720832     init_buffer [144]

                0.13    6.15   13845/210123      write_unlocked_buffers [41]
                1.91   87.15  196278/210123      sync_old_buffers [8]
[6]     17.2    2.04   93.30  210123         write_some_buffers [6]
                0.13   88.77  210087/210087      write_locked_buffers [9]
                1.46    0.97 6722667/20169809     _refile_buffer [39]
                1.93    0.00 6722665/26887936     _insert_into_lru_list [37]

[7]     16.5    0.00   91.66                 kupdate [7]
                0.14   89.26      54/54          sync_old_buffers [8]
                0.23    2.03      54/116         wait_for_buffers [46]

                0.14   89.26      54/54          kupdate [7]
[8]     16.1    0.14   89.26      54         sync_old_buffers [8]
                1.91   87.15  196278/210123      write_some_buffers [6]
                0.20    0.00  196385/50673121     spin_lock_ [12]

                0.13   88.77  210087/210087      write_some_buffers [6]
[9]     16.0    0.13   88.77  210087         write_locked_buffers [9]
                1.10   87.67 6722665/6722669     submit_bh [10]

                0.00    0.00       4/6722669     ll_rw_block [478]
                1.10   87.67 6722665/6722669     write_locked_buffers [9]
[10]    16.0    1.10   87.67 6722669         submit_bh [10]
                0.97   86.70 6722669/6722669     generic_make_request [11]

                0.97   86.70 6722669/6722669     submit_bh [10]
[11]    15.8    0.97   86.70 6722669         generic_make_request [11]
               37.95    6.98 6722669/6722669     _make_request [13]
               25.70   13.88 13445261/13445261     blk_get_queue [14]
                0.62    1.58 6722592/6722592     md_make_request [67]

                0.11    0.00  108665/50673121     _wake_up [113]
                0.16    0.00  156642/50673121     real_lookup [71]
                0.16    0.00  156651/50673121     d_rehash [143]
                0.16    0.00  156653/50673121     d_alloc [138]
                0.16    0.00  156656/50673121     d_instantiate [142]
                0.16    0.00  156661/50673121     get_empty_inode [136]
                0.20    0.00  196385/50673121     sync_old_buffers [8]
                0.22    0.00  218577/50673121     sys_lseek [38]
                0.37    0.00  360440/50673121     timer_bh [87]
                0.39    0.00  389480/50673121     get_unused_buffer_head
[103]
                0.41    0.00  405522/50673121     d_lookup [105]
                0.81    0.00  800334/50673121     atomic_dec_and_lock [83]
                4.81    0.00 4749700/50673121     put_last_free [45]
                6.81    0.00 6722665/50673121     refile_buffer [34]
                6.81    0.00 6722816/50673121     _make_request [13]
               13.62    0.00 13445261/50673121     blk_get_queue [14]
               13.67    0.00 13492508/50673121     getblk [5]
[12]     9.3   51.33    0.00 50673121         spin_lock_ [12]

               37.95    6.98 6722669/6722669     generic_make_request [11]
[13]     8.1   37.95    6.98 6722669         _make_request [13]
                6.81    0.00 6722816/50673121     spin_lock_ [12]
                0.13    0.00 6722816/51086468     spin_unlock_ [78]
                0.00    0.03     147/147         _get_request_wait [187]

               25.70   13.88 13445261/13445261     generic_make_request [11]
[14]     7.1   25.70   13.88 13445261         blk_get_queue [14]
               13.62    0.00 13445261/50673121     spin_lock_ [12]
                0.26    0.00 13445261/51086468     spin_unlock_ [78]

[15]     4.5    0.05   24.67                 ia64_leave_kernel [15]
                0.00   24.28 1479060/1479060     ia64_handle_irq [17]
                0.31    0.05   26208/30536       schedule [111]
                0.00    0.03    3629/3629        ia64_do_page_fault [186]

                0.00   24.28 1492630/1492630     ia64_handle_irq [17]
[16]     4.4    0.00   24.28 1492630         do_IRQ [16]
                0.13   20.78  403703/444995      do_softirq [18]
                0.39    2.89 1492630/1492630     handle_IRQ_event [53]


Lockmeter output (stripped -- holds lock for 1ms+, or 3%+):
-----------------------------------------------------------
System: Linux vanmaren.slc.unisys.com 2.4.8 #23 SMP Fri Aug 24 22:10:04 UTC
2001 ia64
Total counts

All (4) CPUs
Selecting locks that meet ALL of the following:
        utilization :   >  1.00%


Start time: Fri Aug 24 22:23:05 2001
End   time: Fri Aug 24 22:28:43 2001
Delta Time: 337.74 sec.
Hash table slots in use:      388.
Global read lock slots in use: 245.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - -
SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN
RJECT  NAME

       25.1%  6.1us(5450ms)   59us(5817ms)(65.4%)  60013220 74.9% 25.1%
0.00%  *TOTAL*

  3.1% 0.06%   25us(  14ms)    0us                   414766  100%    0%
0.06%  global_bh_lock
  3.1% 0.06%   25us(  14ms)    0us                   414766  100%    0%
0.06%    bh_action+0x60

  6.6%  3.9%  1.1us( 201us)   11us( 223us)(0.63%)  20495246 96.1%  3.9%
0%  io_request_lock

  3.1% 0.44%   28us(5450ms)   16ms(5817ms)( 2.0%)    381041 99.6% 0.44%
0%  kernel_flag
  2.7% 57.1% 1280ms(5450ms)   96us( 183us)(0.00%)         7 42.9% 57.1%
0%    blkdev_put+0xf0
 0.26% 0.19%  6.5us(2393us)   33ms(5450ms)(0.65%)    136598 99.8% 0.19%
0%    real_lookup+0x220
 0.12% 0.56%   27us(1026us)   28us( 709us)(0.00%)     15118 99.4% 0.56%
0%    tty_write+0x4c0
 0.03% 0.38%   29us(5251us)  485ms(5817ms)(0.43%)      3153 99.6% 0.38%
0%    vfs_readdir+0x1e0

 87.8% 95.5%   21us(5450ms)   62us(5450ms)(62.1%)  14241646  4.5% 95.5%
0%  lru_list_lock
  2.7% 14.3% 1280ms(5450ms)  2.6us( 2.6us)(0.00%)         7 85.7% 14.3%
0%    __invalidate_buffers+0x60
 0.03%  100% 4320us(  43ms)   45us( 102us)(0.00%)        20    0%  100%
0%    bdflush+0x1d0
 78.1% 96.9%   37us(  12ms)   62us( 256ms)(31.6%)   7143220  3.1% 96.9%
0%    getblk+0x60
  1.3% 63.0%   80ms( 261ms)   65us( 436us)(0.00%)        54 37.0% 63.0%
0%    kupdate+0x1b0
  4.1% 94.5%  2.0us(2945us)   61us(5450ms)(28.6%)   6725777  5.5% 94.5%
0%    refile_buffer+0x50
  1.0% 95.9%   18us(6871us)   66us(  12ms)(0.92%)    195148  4.1% 95.9%
0%    sync_old_buffers+0x120
 0.47% 85.3%  9.8us(1407us)   81us(  26ms)(0.83%)    162013 14.7% 85.3%
0%    try_to_free_buffers+0xb0
 0.03%  8.7% 1102us(  55ms)   52ms( 199ms)(0.03%)        92 91.3%  8.7%
0%    wait_for_locked_buffers+0x60
 0.19% 27.6%   42us(  20ms)  324us( 253ms)(0.10%)     14915 72.4% 27.6%
0%    write_unlocked_buffers+0x60

  2.1%  9.1%  1.0us(3507us)   11us(  10ms)(0.51%)   6745773 90.9%  9.1%
0%  unused_list_lock
  2.0%  9.1%  1.0us(3507us)   11us(  10ms)(0.50%)   6720952 90.9%  9.1%
0%    get_unused_buffer_head+0x50

  1.0% 0.29%  0.5us(1386us)  2.0us( 712us)(0.00%)   7140899 99.7% 0.29%
0%  getblk+0x2d0

- - - - - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - -
RWLOCK READS   HOLD    MAX  RDR BUSY PERIOD      WAIT
  UTIL  CON    MEAN   RDRS   MEAN(  MAX )   MEAN(  MAX )( %CPU)     TOTAL
NOWAIT SPIN  NAME

       0.00%                                20us(  45us)(0.00%)    847741
100% 0.00%  *TOTAL*

- - - - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - -
RWLOCK WRITES     HOLD           WAIT (ALL)           WAIT (WW)
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )( %CPU)   MEAN(  MAX )     TOTAL
NOWAIT SPIN(  WW )  NAME

       0.00%   18us(  12ms)  6.1us(  52us)(0.00%)  1.9us( 2.3us)  12941611
100% 0.00%(0.00%)  *TOTAL*

 69.2% 0.00%   19us(  12ms)   11us(  52us)(0.00%)  1.9us( 2.3us)  12058253
100% 0.00%(0.00%)  hash_table_lock
  1.8%    0%  1.3us(  49us)    0us                   0us           4753020
100%    0%(   0%)    __invalidate_buffers+0x270
 67.2% 0.00%   32us(  12ms)   11us(  52us)(0.00%)  1.9us( 2.3us)   7143220
100% 0.00%(0.00%)    getblk+0x80
 0.24%    0%  4.9us(1402us)    0us                   0us            162013
100%    0%(   0%)    try_to_free_buffers+0xd0


=========================================================
Single CPU/FS test case: 1 mkfs (md of the 4 md's above):
Note that only 16% is "idle" on a 4X.
=========================================================

Kgprof:
-------
Each sample counts as 0.000976562 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 28.88    158.93   158.93                             cg_record_arc
 15.89    246.38    87.44                             cpu_idle
 14.82    327.95    81.58  6723015     0.01     0.02  getblk
 14.42    407.30    79.35 59278971     0.00     0.00  spin_lock_
  5.23    436.10    28.80 20168512     0.00     0.00  blk_get_queue
  4.60    461.44    25.34  6722884     0.00     0.01  _make_request
  2.56    475.53    14.08                             mcount
  2.21    487.71    12.18                             long_copy_user
  0.91    492.72     5.01    87690     0.06     0.06  scsi_dispatch_cmd
  0.78    496.99     4.27    83855     0.05     0.27  _scsi_end_request
  0.75    501.11     4.12    87690     0.05     0.08  scsi_init_io_vc
  0.68    504.85     3.74        2  1870.61  7179.64  _invalidate_buffers
  0.65    508.43     3.57  6398252     0.00     0.00  end_buffer_io_sync
  0.51    511.25     2.82    56707     0.05     0.06  ahc_linux_isr
  0.49    513.97     2.72      666     4.09     4.09  si_swapinfo
  0.44    516.40     2.43    82692     0.03     0.03  scsi_malloc
  0.32    518.17     1.78 26887713     0.00     0.00  _insert_into_lru_list
[other interesting points]
  0.24    522.20     1.33   216677     0.01     0.55  block_write
  0.21    527.01     1.16    87102     0.01     0.16  scsi_request_fn
  0.10    537.63     0.54       47    11.55    31.51  page_launder
  0.04    544.66     0.24       98     2.43    15.73  wait_for_buffers
  0.01    548.49     0.05       47     0.98  1887.26  sync_old_buffers
  0.00    550.42     0.00       47     0.00    32.21  do_try_to_free_pages
  0.00    550.42     0.00       12     0.00   462.33  write_unlocked_buffers
  0.00    550.42     0.00        8     0.00   793.75  sync_buffers
  0.00    550.42     0.00        8     0.00   100.26
wait_for_locked_buffers
  0.00    550.42     0.00        4     0.00   793.75  fsync_dev
  0.00    550.42     0.00        2     0.00  8767.16  blkdev_put
  0.00    550.42     0.00        2     0.00  1587.54  sys_fsync

[1]     29.6  158.93    0.00                 cg_record_arc [1]

[2]     28.3    0.03  152.01                 ia64_ret_from_syscall [2]
                0.09  127.90  231867/231867      sys_write [3]
                0.03    7.21   66165/66165       sys_read [37]
                0.07    6.84  218393/218393      sys_lseek [38]

                0.09  127.90  231867/231867      ia64_ret_from_syscall [2]
[3]     23.9    0.09  127.90  231867         sys_write [3]
                1.33  118.33  216677/216677      block_write [4]

                1.33  118.33  216677/216677      sys_write [3]
[4]     22.3    1.33  118.33  216677         block_write [4]
               81.58   22.81 6722853/6723015     getblk [5]
                0.64   10.42 6722853/6723043     mark_buffer_dirty [31]

                0.00    0.00       8/6723015     block_read [414]
                0.00    0.00     154/6723015     bread [320]
               81.58   22.81 6722853/6723015     block_write [4]
[5]     19.5   81.58   22.81 6723015         getblk [5]
               18.08    0.00 13509844/59278971     spin_lock_ [13]
                0.00    1.17   32955/32955       refill_freelist [76]
                0.87    0.00 6720919/6720919     _insert_into_queues [84]
                0.76    0.00 6720919/6720919     _remove_from_free_list [89]
                0.58    0.00 6755970/14043341     write_unlock_ [75]
                0.47    0.00 6755970/14043341     write_lock_ [83]
                0.44    0.00 6720919/26887713     _insert_into_lru_list [64]
                0.25    0.00 13509844/59693888     spin_unlock_ [78]
                0.18    0.00 6720919/6720919     init_buffer [142]

                0.01    5.52   12374/210102      write_unlocked_buffers [42]
                0.20   88.19  197728/210102      sync_old_buffers [11]
[6]     17.5    0.21   93.71  210102         write_some_buffers [6]
                0.04   92.58  210092/210092      write_locked_buffers [7]
                0.36    0.28 6722899/22499809     _refile_buffer [59]
                0.44    0.00 6722883/26887713     _insert_into_lru_list [64]

                0.04   92.58  210092/210092      write_some_buffers [6]
[7]     17.3    0.04   92.58  210092         write_locked_buffers [7]
                0.30   92.28 6722883/6722884     submit_bh [8]

                0.00    0.00       1/6722884     ll_rw_block [507]
                0.30   92.28 6722883/6722884     write_locked_buffers [7]
[8]     17.3    0.30   92.28 6722884         submit_bh [8]
                0.25   92.02 6722884/6722884     generic_make_request [9]

                0.25   92.02 6722884/6722884     submit_bh [8]
[9]     17.2    0.25   92.02 6722884         generic_make_request [9]
               28.80   27.38 20168512/20168512     blk_get_queue [14]
               25.34    9.16 6722884/6722884     _make_request [15]
                0.33    1.01 13445628/13445628     md_make_request [70]

[10]    16.7    0.00   89.44                 kupdate [10]
                0.05   88.66      47/47          sync_old_buffers [11]
                0.11    0.62      47/98          wait_for_buffers [65]

                0.05   88.66      47/47          kupdate [10]
[11]    16.5    0.05   88.66      47         sync_old_buffers [11]
                0.20   88.19  197728/210102      write_some_buffers [6]
                0.26    0.00  197821/59278971     spin_lock_ [13]
                0.00    0.00      47/47          sync_unlocked_inodes [295]
                0.00    0.00      47/51          sync_supers [424]
                0.00    0.00     187/59693888     spin_unlock_ [78]

[12]    16.3   87.44    0.09                 cpu_idle [12]
                0.06    0.03   10282/31233       schedule [129]

                0.15    0.00  109352/59278971     ahc_linux_isr [50]
                0.15    0.00  113900/59278971     do_IRQ [33]
                0.18    0.00  136509/59278971     real_lookup [63]
                0.18    0.00  136518/59278971     d_alloc [132]
                0.18    0.00  136518/59278971     d_rehash [141]
                0.18    0.00  136521/59278971     d_instantiate [140]
                0.18    0.00  136521/59278971     get_empty_inode [133]
                0.19    0.00  139538/59278971     _wake_up [107]
                0.23    0.00  173973/59278971     try_to_free_buffers [91]
                0.26    0.00  197821/59278971     sync_old_buffers [11]
                0.29    0.00  218393/59278971     sys_lseek [38]
                0.47    0.00  353459/59278971     d_lookup [101]
                0.47    0.00  354202/59278971     timer_bh [99]
                0.71    0.00  527288/59278971     get_unused_buffer_head
[87]
                0.93    0.00  696032/59278971     atomic_dec_and_lock [82]
                8.45    0.00 6310839/59278971     put_last_free [36]
                9.00    0.00 6722880/59278971     refile_buffer [35]
                9.00    0.00 6723070/59278971     _make_request [15]
               18.08    0.00 13509844/59278971     getblk [5]
               27.00    0.00 20168512/59278971     blk_get_queue [14]
[13]    14.8   79.35    0.00 59278971         spin_lock_ [13]

               28.80   27.38 20168512/20168512     generic_make_request [9]
[14]    10.5   28.80   27.38 20168512         blk_get_queue [14]
               27.00    0.00 20168512/59278971     spin_lock_ [13]
                0.38    0.00 20168512/59693888     spin_unlock_ [78]

Lockstat:
---------
System: Linux vanmaren.slc.unisys.com 2.4.8 #23 SMP Fri Aug 24 22:10:04 UTC
2001 ia64
Total counts

All (4) CPUs
Selecting locks that meet ALL of the following:
        utilization :   >  1.00%

Start time: Fri Aug 24 22:43:51 2001
End   time: Fri Aug 24 22:48:34 2001
Delta Time: 283.01 sec.
Hash table slots in use:      297.
Global read lock slots in use: 214.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - -
SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN
RJECT  NAME

        1.1%  4.6us( 582ms)   47us(5781ms)( 2.7%)  59050695 98.9%  1.1%
0.02%  *TOTAL*

  1.0%  5.1%   18us( 111us)   35us( 576us)(0.03%)    163560 94.9%  5.1%
0%  allocator_request_lock

  1.4% 0.01%  0.3us( 7.2us)   15us(  41us)(0.00%)  13108126  100% 0.01%
0%  free_list+0x18

  3.8%  2.3%   27us(7193us)    0us                   400694 97.7%    0%
2.3%  global_bh_lock
  3.8%  2.3%   27us(7193us)    0us                   400694 97.7%    0%
2.3%    bh_action+0x60

  5.5% 0.78%  0.6us(  96us)  3.4us(  95us)(0.06%)  27316523 99.2% 0.78%
0%  io_request_lock
  3.6% 0.27%  1.5us(  42us)  3.9us(  45us)(0.01%)   6723632 99.7% 0.27%
0%    __make_request+0x290

 82.4%  3.1%   17us( 582ms)   32us( 582ms)( 1.2%)  13719319 96.9%  3.1%
0%  lru_list_lock
 0.04%    0%   59ms( 117ms)    0us                        2  100%    0%
0%    __invalidate_buffers+0x60
 78.3% 0.26%   33us(  97us)   89us( 296ms)(0.14%)   6746687 99.7% 0.26%
0%    getblk+0x60
  1.5% 18.2%   99ms( 582ms)  5.4us(  29us)(0.00%)        44 81.8% 18.2%
0%    kupdate+0x1b0
 0.01%    0%  603us(  15ms)    0us                       48  100%    0%
0%    wait_for_locked_buffers+0x60

- - - - - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - -
RWLOCK READS   HOLD    MAX  RDR BUSY PERIOD      WAIT
  UTIL  CON    MEAN   RDRS   MEAN(  MAX )   MEAN(  MAX )( %CPU)     TOTAL
NOWAIT SPIN  NAME

       0.00%                               1.2us( 1.2us)(0.00%)    766789
100% 0.00%  *TOTAL*

- - - - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - -
RWLOCK WRITES     HOLD           WAIT (ALL)           WAIT (WW)
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )( %CPU)   MEAN(  MAX )     TOTAL
NOWAIT SPIN(  WW )  NAME

          0%   16us(  97us)    0us                   0us          13859543
100%    0%(   0%)  *TOTAL*

 79.8%    0%   17us(  97us)    0us                   0us          13125889
100%    0%(   0%)  hash_table_lock
  2.6%    0%  1.2us( 5.4us)    0us                   0us           6339799
100%    0%(   0%)    __invalidate_buffers+0x270
 76.9%    0%   32us(  97us)    0us                   0us           6746687
100%    0%(   0%)    getblk+0x80

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: The cause of the "VM" performance problem with 2.4.X
  2001-08-28 17:35 The cause of the "VM" performance problem with 2.4.X Van Maren, Kevin
@ 2001-08-28 18:52 ` Linus Torvalds
  2001-08-28 19:29   ` André Dahlqvist
  2001-08-29  8:22   ` Jens Axboe
  0 siblings, 2 replies; 14+ messages in thread
From: Linus Torvalds @ 2001-08-28 18:52 UTC (permalink / raw)
  To: kevin.vanmaren, linux-kernel

In article <245F259ABD41D511A07000D0B71C4CBA289F5F@us-slc-exch-3.slc.unisys.com> you
write:
>
>Okay, here are some more results with Morton's "2nd" patch,
>running 2.4.8,IA64-010815 on 4X IA64 (Lion) with 8GB RAM,
>733MHz B3 (4MB) CPUs.  3 Adaptec 39160 channels, each with
>8 (18GB) Cheetah drives, create 4 md's using 2 disks from
>each channel.  Also using Gibbs' 6.2.0 adaptec driver.
>I tried to extract the "important" info to keep this message
>relatively small.  Meant to get this out earlier...

Interesting.

>It looks like "getblk" could use some additional optimization.
>More info to follow, but even doing one "mkfs" at a time,
>the lru_list_lock is held over 82% of the time, mostly by
>getblk().

Note that the real fix is to make the block devices use the page cache.
Andrea has patches, but it's a 2.5.x thing.

The buffer cache is really meant to be used only for meta-data, and
"getblk()" should not show up on profiles under any reasonable load. I
seriously doubt that running parallel (or even a single) "mkfs" is a
reasonable load under normal use.

So I don't consider the buffer cache locking to be a critical
performance issue - but I _do_ want to make sure that we don't have any
_really_ horrid special cases where we just perform so abysmally that
it's not worth using Linux at all.

So small inefficiencies and lack of parallelism is not a huge issue for
the buffer cache - at least not until somebody finds a more
"interesting" load per se. 

As to the really bad behaviour with the CPU spending all the time just
traversing the dirty buffer lists, I'd love to hear what 2.4.10-pre2
does.  It doesn't use Andrews advanced patch, but instead just makes
dirty buffer balancing not care about the device - which should allow
much better IO behaviour and not have any silly bad cases from a VM
standpoint. 


>Abbreiated/stripped kernprof/gprof output:
>------------------------------------------
>
>Each sample counts as 0.000976562 seconds.
>  %   cumulative   self              self     total
> time   seconds   seconds    calls  ms/call  ms/call  name
> 39.46    224.65   224.65                             cg_record_arc
> 16.40    318.00    93.34  6722992     0.01     0.02  getblk
>  9.02    369.33    51.33 50673121     0.00     0.00  spin_lock_
>  6.67    407.27    37.95  6722669     0.01     0.01  _make_request
>  4.51    432.97    25.70 13445261     0.00     0.00  blk_get_queue
>  2.61    447.83    14.86                             long_copy_user
>  2.59    462.56    14.72                             mcount
>  2.06    474.27    11.71                             cpu_idle

Now, while I don't worry about "getblk()" itself, the request stuff and
blk_get_queue() _can_ be quite an issue even under non-mkfs load, so
that certainly implies that we have some work to do. The area of the
block device request queueing is definitely something that 2.5.x will
start working on (Jens is already doing lots of stuff, obviously).

And your lock profile certainly shows the io_request_lock as a _major_
lock user, although I'm happy to see that contention seems to be
reasonably low. Still, I'd bet that it is worth working on..

		Linus

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: The cause of the "VM" performance problem with 2.4.X
  2001-08-28 18:52 ` Linus Torvalds
@ 2001-08-28 19:29   ` André Dahlqvist
  2001-08-29 13:49     ` Rik van Riel
  2001-08-29  8:22   ` Jens Axboe
  1 sibling, 1 reply; 14+ messages in thread
From: André Dahlqvist @ 2001-08-28 19:29 UTC (permalink / raw)
  To: linux-kernel

Linus Torvalds <torvalds@transmeta.com> wrote:

> Note that the real fix is to make the block devices use the page cache.
> Andrea has patches, but it's a 2.5.x thing.

Linus, could you please make a statement when 2.5.x is likely to start?
A lot of people wonder this and we all know you have some sort of plan
laid out. Tell us about it:-)
-- 

André Dahlqvist <andre.dahlqvist@telia.com>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: The cause of the "VM" performance problem with 2.4.X
  2001-08-28 18:52 ` Linus Torvalds
  2001-08-28 19:29   ` André Dahlqvist
@ 2001-08-29  8:22   ` Jens Axboe
  2001-08-29  8:25     ` Jens Axboe
  1 sibling, 1 reply; 14+ messages in thread
From: Jens Axboe @ 2001-08-29  8:22 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: kevin.vanmaren, linux-kernel

On Tue, Aug 28 2001, Linus Torvalds wrote:
> >Abbreiated/stripped kernprof/gprof output:
> >------------------------------------------
> >
> >Each sample counts as 0.000976562 seconds.
> >  %   cumulative   self              self     total
> > time   seconds   seconds    calls  ms/call  ms/call  name
> > 39.46    224.65   224.65                             cg_record_arc
> > 16.40    318.00    93.34  6722992     0.01     0.02  getblk
> >  9.02    369.33    51.33 50673121     0.00     0.00  spin_lock_
> >  6.67    407.27    37.95  6722669     0.01     0.01  _make_request
> >  4.51    432.97    25.70 13445261     0.00     0.00  blk_get_queue
> >  2.61    447.83    14.86                             long_copy_user
> >  2.59    462.56    14.72                             mcount
> >  2.06    474.27    11.71                             cpu_idle
> 
> Now, while I don't worry about "getblk()" itself, the request stuff and
> blk_get_queue() _can_ be quite an issue even under non-mkfs load, so

blk_get_queue() is easy to 'fix', it grabs io_request_lock for no good
reason at all. I think this must have been a failed attempt to protect
switching of queues, however it's obviously very broken in this regard.
So in fact no skin is off our nose for just removing the io_request_lock
in that path. 2.5 will have it properly reference counted...

> And your lock profile certainly shows the io_request_lock as a _major_
> lock user, although I'm happy to see that contention seems to be
> reasonably low. Still, I'd bet that it is worth working on..

Sure is, the bio patches have not had io_request_lock in them for some
time.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: The cause of the "VM" performance problem with 2.4.X
  2001-08-29  8:22   ` Jens Axboe
@ 2001-08-29  8:25     ` Jens Axboe
  0 siblings, 0 replies; 14+ messages in thread
From: Jens Axboe @ 2001-08-29  8:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: kevin.vanmaren, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 717 bytes --]

On Wed, Aug 29 2001, Jens Axboe wrote:
> > >  4.51    432.97    25.70 13445261     0.00     0.00  blk_get_queue
> > 
> > Now, while I don't worry about "getblk()" itself, the request stuff and
> > blk_get_queue() _can_ be quite an issue even under non-mkfs load, so
> 
> blk_get_queue() is easy to 'fix', it grabs io_request_lock for no good
> reason at all. I think this must have been a failed attempt to protect
> switching of queues, however it's obviously very broken in this regard.
> So in fact no skin is off our nose for just removing the io_request_lock
> in that path. 2.5 will have it properly reference counted...

Linus, will you take this patch to remove io_request_lock in this path?

-- 
Jens Axboe


[-- Attachment #2: blk-get-queue-lock-1 --]
[-- Type: text/plain, Size: 1734 bytes --]

--- drivers/block/ll_rw_blk.c~	Wed Aug 29 10:22:50 2001
+++ drivers/block/ll_rw_blk.c	Wed Aug 29 10:23:12 2001
@@ -131,7 +131,7 @@
 	return max_sectors[MAJOR(dev)][MINOR(dev)];
 }
 
-inline request_queue_t *__blk_get_queue(kdev_t dev)
+inline request_queue_t *blk_get_queue(kdev_t dev)
 {
 	struct blk_dev_struct *bdev = blk_dev + MAJOR(dev);
 
@@ -141,22 +141,6 @@
 		return &blk_dev[MAJOR(dev)].request_queue;
 }
 
-/*
- * NOTE: the device-specific queue() functions
- * have to be atomic!
- */
-request_queue_t *blk_get_queue(kdev_t dev)
-{
-	request_queue_t *ret;
-	unsigned long flags;
-
-	spin_lock_irqsave(&io_request_lock,flags);
-	ret = __blk_get_queue(dev);
-	spin_unlock_irqrestore(&io_request_lock,flags);
-
-	return ret;
-}
-
 static int __blk_cleanup_queue(struct list_head *head)
 {
 	struct request *rq;
@@ -1272,7 +1256,6 @@
 EXPORT_SYMBOL(end_that_request_last);
 EXPORT_SYMBOL(blk_init_queue);
 EXPORT_SYMBOL(blk_get_queue);
-EXPORT_SYMBOL(__blk_get_queue);
 EXPORT_SYMBOL(blk_cleanup_queue);
 EXPORT_SYMBOL(blk_queue_headactive);
 EXPORT_SYMBOL(blk_queue_make_request);
--- include/linux/blkdev.h~	Wed Aug 29 10:23:19 2001
+++ include/linux/blkdev.h	Wed Aug 29 10:23:29 2001
@@ -149,8 +149,7 @@
 extern void grok_partitions(struct gendisk *dev, int drive, unsigned minors, long size);
 extern void register_disk(struct gendisk *dev, kdev_t first, unsigned minors, struct block_device_operations *ops, long size);
 extern void generic_make_request(int rw, struct buffer_head * bh);
-extern request_queue_t *blk_get_queue(kdev_t dev);
-extern inline request_queue_t *__blk_get_queue(kdev_t dev);
+extern inline request_queue_t *blk_get_queue(kdev_t dev);
 extern void blkdev_release_request(struct request *);
 
 /*

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: The cause of the "VM" performance problem with 2.4.X
  2001-08-28 19:29   ` André Dahlqvist
@ 2001-08-29 13:49     ` Rik van Riel
  0 siblings, 0 replies; 14+ messages in thread
From: Rik van Riel @ 2001-08-29 13:49 UTC (permalink / raw)
  To: André Dahlqvist; +Cc: linux-kernel

On Tue, 28 Aug 2001, André Dahlqvist wrote:
> Linus Torvalds <torvalds@transmeta.com> wrote:
>
> > Note that the real fix is to make the block devices use the page cache.
> > Andrea has patches, but it's a 2.5.x thing.
>
> Linus, could you please make a statement when 2.5.x is likely to start?
> A lot of people wonder this and we all know you have some sort of plan
> laid out. Tell us about it:-)

2.4.8-pre* seems to have been the turning point where
experimental stuff started getting introduced ;)


Rik
-- 
IA64: a worthy successor to i860.

http://www.surriel.com/		http://distro.conectiva.com/

Send all your spam to aardvark@nl.linux.org (spam digging piggy)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: The cause of the "VM" performance problem with 2.4.X
@ 2001-08-23 17:26 Van Maren, Kevin
  0 siblings, 0 replies; 14+ messages in thread
From: Van Maren, Kevin @ 2001-08-23 17:26 UTC (permalink / raw)
  To: 'Andrew Morton'; +Cc: 'linux-kernel@vger.kernel.org'

I did a "mkfs" on a "md" of 3 8-disk "md"s (24 disk md).  So no
lock contention, and only one "dev" to match, right?  [Try to
avoid the problem.]

getblk was holding the lru_list_lock lock 86% of the time.

Was averaging about 25,000 blocks/sec to the disks (vmstat),
but it was often 200,000 over 2 seconds, then 5 seconds idle,
etc.  Disk lights didn't come on much (but they all blinked
nicely in sync).

I'll try your patch out and run individual disks again.

Kevin

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: The cause of the "VM" performance problem with 2.4.X
  2001-08-23 17:06 Van Maren, Kevin
@ 2001-08-23 17:18 ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2001-08-23 17:18 UTC (permalink / raw)
  To: Van Maren, Kevin; +Cc: 'linux-kernel@vger.kernel.org'

"Van Maren, Kevin" wrote:
> 
> It looks like the same patch as yesterday.  Did you attach the wrong
> patch?

Drat, sorry - it's on the other machine.  Here's a reconstruction:

--- linux-2.4.9/fs/buffer.c	Thu Aug 16 12:23:19 2001
+++ linux-akpm/fs/buffer.c	Thu Aug 23 10:15:41 2001
@@ -199,7 +199,7 @@ static void write_locked_buffers(struct 
  * return without it!
  */
 #define NRSYNC (32)
-static int write_some_buffers(kdev_t dev)
+static int write_some_buffers(kdev_t dev, struct buffer_head **start_bh)
 {
 	struct buffer_head *next;
 	struct buffer_head *array[NRSYNC];
@@ -207,6 +207,12 @@ static int write_some_buffers(kdev_t dev
 	int nr;
 
 	next = lru_list[BUF_DIRTY];
+	if (start_bh && *start_bh) {
+		if ((*start_bh)->b_list == BUF_DIRTY)
+			next = *start_bh;
+		brelse(*start_bh);
+		*start_bh = NULL;
+	}
 	nr = nr_buffers_type[BUF_DIRTY] * 2;
 	count = 0;
 	while (next && --nr >= 0) {
@@ -215,8 +221,11 @@ static int write_some_buffers(kdev_t dev
 
 		if (dev && bh->b_dev != dev)
 			continue;
-		if (test_and_set_bit(BH_Lock, &bh->b_state))
+		if (test_and_set_bit(BH_Lock, &bh->b_state)) {
+			/* Shouldn't be on BUF_DIRTY */
+			__refile_buffer(bh);
 			continue;
+		}
 		if (atomic_set_buffer_clean(bh)) {
 			__refile_buffer(bh);
 			get_bh(bh);
@@ -224,6 +233,10 @@ static int write_some_buffers(kdev_t dev
 			if (count < NRSYNC)
 				continue;
 
+			if (start_bh && next) {
+				get_bh(next);
+				*start_bh = next;
+			}
 			spin_unlock(&lru_list_lock);
 			write_locked_buffers(array, count);
 			return -EAGAIN;
@@ -243,9 +256,11 @@ static int write_some_buffers(kdev_t dev
  */
 static void write_unlocked_buffers(kdev_t dev)
 {
+	struct buffer_head *start_bh = NULL;
 	do {
 		spin_lock(&lru_list_lock);
-	} while (write_some_buffers(dev));
+	} while (write_some_buffers(dev, &start_bh));
+	brelse(start_bh);
 	run_task_queue(&tq_disk);
 }
 
@@ -1116,14 +1131,14 @@ int balance_dirty_state(kdev_t dev)
  */
 void balance_dirty(kdev_t dev)
 {
-	int state = balance_dirty_state(dev);
+	int state = balance_dirty_state(NODEV);
 
 	if (state < 0)
 		return;
 
 	/* If we're getting into imbalance, start write-out */
 	spin_lock(&lru_list_lock);
-	write_some_buffers(dev);
+	write_some_buffers(NODEV, NULL);
 
 	/*
 	 * And if we're _really_ out of balance, wait for
@@ -1132,8 +1147,9 @@ void balance_dirty(kdev_t dev)
 	 * This will throttle heavy writers.
 	 */
 	if (state > 0) {
-		wait_for_some_buffers(dev);
+		write_some_buffers(dev, NULL);
 		wakeup_bdflush();
+		wait_for_some_buffers(dev);
 	}
 }
 
@@ -2607,7 +2623,7 @@ static int sync_old_buffers(void)
 		bh = lru_list[BUF_DIRTY];
 		if (!bh || time_before(jiffies, bh->b_flushtime))
 			break;
-		if (write_some_buffers(NODEV))
+		if (write_some_buffers(NODEV, NULL))
 			continue;
 		return 0;
 	}
@@ -2706,7 +2722,7 @@ int bdflush(void *startup)
 		CHECK_EMERGENCY_SYNC
 
 		spin_lock(&lru_list_lock);
-		if (!write_some_buffers(NODEV) || balance_dirty_state(NODEV) < 0) {
+		if (!write_some_buffers(NODEV, NULL) || balance_dirty_state(NODEV) < 0) {
 			wait_for_some_buffers(NODEV);
 			interruptible_sleep_on(&bdflush_wait);
 		}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: The cause of the "VM" performance problem with 2.4.X
@ 2001-08-23 17:06 Van Maren, Kevin
  2001-08-23 17:18 ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Van Maren, Kevin @ 2001-08-23 17:06 UTC (permalink / raw)
  To: 'Andrew Morton'; +Cc: 'linux-kernel@vger.kernel.org'

> > Yes, it helps quite a bit.  Still not as good as I'd like: I don't
> > dare try lots of disks yet :-(
> 
> Great, thanks.  Aside from the lock contention and stuff, how was the
> actual disk throughput affected?  You said that the stock kernel
> basically stops doing anything, yes?

It got really bad.  It would make "progress", but slower than I
can encode the bits on the platter with a hand magnet ;-)
Now, it's only "pretty bad".

> > Looks like blkdev_put() holds kernel_flag for way too long.
> 
> It calls fsync_dev().

Right.  Which ends up calling our friend, write_unlocked_buffers(),
which does locking, but it also holds the kernel lock the entire time.

> There are two n^2 loops in buffer.c.  There's one on the 
> sync_buffers()
> path, which we fixed yesterday.  But there's also a potential 
> O(2n) path
> in balance_dirty().  So if we're calling mark_buffer_dirty() a lot,
> this becomes quadratic.  For this to bite us the BUF_DIRTY list would
> have to be "almost full" of buffers for another device.  There's also
> some weird stuff in sync_buffers() - not sure what it's trying to do.
> I'll take that up with the boss when he gets back from the polar bear
> hunt or whatever it is they do over there.
> 
> Here's a different patch which addresses the balance_dirty() thing
> as well..

It looks like the same patch as yesterday.  Did you attach the wrong
patch?  In any event, it doesn't look like it helps balance_dirty()
at all because the location is not retained across multiple calls
to write_some_buffers().

Kevin Van Maren

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: The cause of the "VM" performance problem with 2.4.X
  2001-08-23  1:48 Van Maren, Kevin
@ 2001-08-23 16:33 ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2001-08-23 16:33 UTC (permalink / raw)
  To: Van Maren, Kevin; +Cc: 'linux-kernel@vger.kernel.org'

"Van Maren, Kevin" wrote:
> 
> > This (rather hastily tested) patch against 2.4.9 should give O(n)
> > behaviour in write_unlocked_buffers().  Does it help?
> 
> Yes, it helps quite a bit.  Still not as good as I'd like: I don't
> dare try lots of disks yet :-(

Great, thanks.  Aside from the lock contention and stuff, how was the
actual disk throughput affected?  You said that the stock kernel
basically stops doing anything, yes?
 
> Here is the new lockmeter output (8 parallel mkfs processes on 4X
> Lion to different disks, entire mkfs execution time).
> 
> Looks like blkdev_put() holds kernel_flag for way too long.
> 

It calls fsync_dev().

There are two n^2 loops in buffer.c.  There's one on the sync_buffers()
path, which we fixed yesterday.  But there's also a potential O(2n) path
in balance_dirty().  So if we're calling mark_buffer_dirty() a lot,
this becomes quadratic.  For this to bite us the BUF_DIRTY list would
have to be "almost full" of buffers for another device.  There's also
some weird stuff in sync_buffers() - not sure what it's trying to do.
I'll take that up with the boss when he gets back from the polar bear
hunt or whatever it is they do over there.

Here's a different patch which addresses the balance_dirty() thing
as well..


--- linux-2.4.9/fs/buffer.c	Thu Aug 16 12:23:19 2001
+++ linux-akpm/fs/buffer.c	Wed Aug 22 13:16:22 2001
@@ -199,7 +199,7 @@ static void write_locked_buffers(struct 
  * return without it!
  */
 #define NRSYNC (32)
-static int write_some_buffers(kdev_t dev)
+static int write_some_buffers(kdev_t dev, struct buffer_head **start_bh)
 {
 	struct buffer_head *next;
 	struct buffer_head *array[NRSYNC];
@@ -207,6 +207,12 @@ static int write_some_buffers(kdev_t dev
 	int nr;
 
 	next = lru_list[BUF_DIRTY];
+	if (start_bh && *start_bh) {
+		if ((*start_bh)->b_list == BUF_DIRTY)
+			next = *start_bh;
+		brelse(*start_bh);
+		*start_bh = NULL;
+	}
 	nr = nr_buffers_type[BUF_DIRTY] * 2;
 	count = 0;
 	while (next && --nr >= 0) {
@@ -215,8 +221,11 @@ static int write_some_buffers(kdev_t dev
 
 		if (dev && bh->b_dev != dev)
 			continue;
-		if (test_and_set_bit(BH_Lock, &bh->b_state))
+		if (test_and_set_bit(BH_Lock, &bh->b_state)) {
+			/* Shouldn't be on BUF_DIRTY */
+			__refile_buffer(bh);
 			continue;
+		}
 		if (atomic_set_buffer_clean(bh)) {
 			__refile_buffer(bh);
 			get_bh(bh);
@@ -224,6 +233,10 @@ static int write_some_buffers(kdev_t dev
 			if (count < NRSYNC)
 				continue;
 
+			if (start_bh && next) {
+				get_bh(next);
+				*start_bh = next;
+			}
 			spin_unlock(&lru_list_lock);
 			write_locked_buffers(array, count);
 			return -EAGAIN;
@@ -243,9 +256,11 @@ static int write_some_buffers(kdev_t dev
  */
 static void write_unlocked_buffers(kdev_t dev)
 {
+	struct buffer_head *start_bh = NULL;
 	do {
 		spin_lock(&lru_list_lock);
-	} while (write_some_buffers(dev));
+	} while (write_some_buffers(dev, &start_bh));
+	brelse(start_bh);
 	run_task_queue(&tq_disk);
 }
 
@@ -1117,13 +1132,15 @@ int balance_dirty_state(kdev_t dev)
 void balance_dirty(kdev_t dev)
 {
 	int state = balance_dirty_state(dev);
+	struct buffer_head *start_bh = NULL;
 
 	if (state < 0)
 		return;
 
 	/* If we're getting into imbalance, start write-out */
 	spin_lock(&lru_list_lock);
-	write_some_buffers(dev);
+	write_some_buffers(dev, &start_bh);
+	brelse(start_bh);
 
 	/*
 	 * And if we're _really_ out of balance, wait for
@@ -2607,7 +2624,7 @@ static int sync_old_buffers(void)
 		bh = lru_list[BUF_DIRTY];
 		if (!bh || time_before(jiffies, bh->b_flushtime))
 			break;
-		if (write_some_buffers(NODEV))
+		if (write_some_buffers(NODEV, NULL))
 			continue;
 		return 0;
 	}
@@ -2706,7 +2723,7 @@ int bdflush(void *startup)
 		CHECK_EMERGENCY_SYNC
 
 		spin_lock(&lru_list_lock);
-		if (!write_some_buffers(NODEV) || balance_dirty_state(NODEV) < 0) {
+		if (!write_some_buffers(NODEV, NULL) || balance_dirty_state(NODEV) < 0) {
 			wait_for_some_buffers(NODEV);
 			interruptible_sleep_on(&bdflush_wait);
 		}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: The cause of the "VM" performance problem with 2.4.X
@ 2001-08-23  1:48 Van Maren, Kevin
  2001-08-23 16:33 ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Van Maren, Kevin @ 2001-08-23  1:48 UTC (permalink / raw)
  To: 'Andrew Morton'; +Cc: 'linux-kernel@vger.kernel.org'

> This (rather hastily tested) patch against 2.4.9 should give O(n)
> behaviour in write_unlocked_buffers().  Does it help?

Yes, it helps quite a bit.  Still not as good as I'd like: I don't
dare try lots of disks yet :-(

Here is the new lockmeter output (8 parallel mkfs processes on 4X
Lion to different disks, entire mkfs execution time).

Looks like blkdev_put() holds kernel_flag for way too long.

Looking under the lru_list_lock, you can see several functions that hold
the spinlock > 300ms.  getblk only holds it for a max of 7ms, but it gets
called so much it dominates the execution time.


____________________________________________________________________________
_______________
System: Linux vanmaren.slc.unisys.com 2.4.8 #13 SMP Wed Aug 22 18:34:15 UTC
2001 ia64
Total counts

All (4) CPUs
Selecting locks that meet ALL of the following:
        utilization :   >  1.00%


Start time: Wed Aug 22 19:10:17 2001
End   time: Wed Aug 22 19:11:30 2001
Delta Time: 72.67 sec.
Hash table slots in use:      254.
Global read lock slots in use: 101.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - -
SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN
RJECT  NAME

       29.4%  4.8us(2809ms)   40us(1780ms)(62.4%)  15255259 70.6% 29.4%
0.00%  *TOTAL*

  3.9% 0.06%   29us(5958us)    0us                    96244  100%    0%
0.06%  global_bh_lock
  3.9% 0.06%   29us(5958us)    0us                    96244  100%    0%
0.06%    bh_action+0x40

  6.6%  9.1%  1.3us( 111us)   14us( 234us)( 1.7%)   3773504 90.9%  9.1%
0%  io_request_lock
 0.00% 10.8%  0.5us( 2.6us)   13us(  68us)(0.00%)       655 89.2% 10.8%
0%    __get_request_wait+0xb0
  5.4%  3.4%  2.1us(  77us)  9.8us( 234us)(0.21%)   1846091 96.6%  3.4%
0%    __make_request+0x270
 0.05% 18.6%  2.6us( 111us)  6.2us( 123us)(0.01%)     13228 81.4% 18.6%
0%    ahc_linux_isr+0x3d0
 0.77% 14.4%  0.3us( 3.8us)   16us( 226us)( 1.4%)   1845765 85.6% 14.4%
0%    blk_get_queue+0x50
 0.00% 18.6%  3.7us(  19us)   12us( 117us)(0.00%)       575 81.4% 18.6%
0%    generic_unplug_device+0x40
 0.29% 22.3%   11us(  92us)   11us( 154us)(0.02%)     18488 77.7% 22.3%
0%    scsi_dispatch_cmd+0x2b0
 0.01% 18.5%  0.4us( 3.3us)  5.8us( 132us)(0.01%)     16157 81.5% 18.5%
0%    scsi_finish_command+0x60
 0.05% 17.5%  2.1us(  95us)  6.3us(  95us)(0.01%)     16157 82.5% 17.5%
0%    scsi_queue_next_request+0x40
 0.07% 11.8%  2.9us(  37us)  3.4us( 107us)(0.00%)     16388 88.2% 11.8%
0%    scsi_request_fn+0x510

  8.0% 0.45%   54us(2809ms)   25us(2394us)(0.00%)    107442 99.5% 0.45%
0%  kernel_flag
 0.00%    0%  0.4us( 0.5us)    0us                        3  100%    0%
0%    acct_process+0x140
 0.00% 56.2%  5.1us(  12us)  559us(2394us)(0.00%)        16 43.8% 56.2%
0%    blkdev_open+0xb0
  7.6% 20.0%  553ms(2809ms)   46us(  59us)(0.00%)        10 80.0% 20.0%
0%    blkdev_put+0xd0
 0.00%    0%  2.3us( 3.3us)    0us                        8  100%    0%
0%    chrdev_open+0xc0
 0.00%    0%   52us(  63us)    0us                        3  100%    0%
0%    do_exit+0x1f0
 0.00%    0%  0.5us( 0.9us)    0us                       63  100%    0%
0%    ext2_discard_prealloc+0x50
 0.00% 14.9%  1.9us(  41us)   14us(  28us)(0.00%)        74 85.1% 14.9%
0%    ext2_get_block+0xb0
 0.00%    0%  7.0us( 8.2us)    0us                        6  100%    0%
0%    fsync_dev+0x80
 0.00% 0.40%  1.3us( 9.9us)   20us(  57us)(0.00%)      1260 99.6% 0.40%
0%    posix_lock_file+0x2d0
 0.19% 0.15%  4.9us( 383us)  8.2us(  87us)(0.00%)     27984 99.9% 0.15%
0%    real_lookup+0x200
 0.00% 27.3%  6.4us(  17us)   15us(  20us)(0.00%)        11 72.7% 27.3%
0%    schedule+0x8c0
 0.00%    0%  219us(1031us)    0us                       10  100%    0%
0%    sync_old_buffers+0x50
 0.00% 22.2%  5.4us(  11us)   30us(  34us)(0.00%)         9 77.8% 22.2%
0%    sys_ioctl+0x90
 0.03% 0.54%  0.3us( 176us)   14us( 580us)(0.00%)     73179 99.5% 0.54%
0%    sys_lseek+0xf0
 0.11% 0.48%   20us( 586us)   34us( 172us)(0.00%)      4165 99.5% 0.48%
0%    tty_write+0x4a0
 0.02% 0.16%   24us(  64us)  2.4us( 2.4us)(0.00%)       640 99.8% 0.16%
0%    vfs_readdir+0x1c0
 0.00%    0%   31us(  31us)    0us                        1  100%    0%
0%    vfs_statfs+0xc0

 78.3% 87.2%   12us(1780ms)   43us(1780ms)(60.6%)   4681456 12.8% 87.2%
0%  lru_list_lock
  4.8% 30.0%  350ms(1780ms)   61us(  77us)(0.00%)        10 70.0% 30.0%
0%    __invalidate_buffers+0x40
 50.8% 78.5%   15us(7412us)   34us( 366ms)(21.6%)   2382129 21.5% 78.5%
0%    getblk+0x40
 0.79% 66.7%   64ms( 366ms)   36us(  97us)(0.00%)         9 33.3% 66.7%
0%    kupdate+0x190
  6.8% 97.0%  2.2us(1204us)   34us( 366ms)(25.2%)   2241541  3.0% 97.0%
0%    refile_buffer+0x30
 0.80% 90.9%   15us(3212us)  401us(1780ms)( 4.9%)     38766  9.1% 90.9%
0%    sync_old_buffers+0x100
  3.4% 12.5%   77ms( 327ms)   22us(  54us)(0.00%)        32 87.5% 12.5%
0%    wait_for_locked_buffers+0x40
 11.0% 18.6%  420us( 344ms) 7344us(1780ms)( 8.9%)     18969 81.4% 18.6%
0%    write_unlocked_buffers+0x40

  1.3% 0.45%  0.4us(3859us)  2.0us( 180us)(0.01%)   2381262 99.6% 0.45%
0%  getblk+0x2b0

- - - - - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - -
RWLOCK READS   HOLD    MAX  RDR BUSY PERIOD      WAIT
  UTIL  CON    MEAN   RDRS   MEAN(  MAX )   MEAN(  MAX )( %CPU)     TOTAL
NOWAIT SPIN  NAME

       0.00%                               1.2us( 1.5us)(0.00%)    229513
100% 0.00%  *TOTAL*

- - - - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - -
RWLOCK WRITES     HOLD           WAIT (ALL)           WAIT (WW)
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )( %CPU)   MEAN(  MAX )     TOTAL
NOWAIT SPIN(  WW )  NAME

 36.6%    0%  9.0us(3880us)    0us                   0us           2942425
100%    0%(   0%)  hash_table_lock
  1.1%    0%  1.4us( 6.5us)    0us                   0us            560296
100%    0%(   0%)    __invalidate_buffers+0x250
 35.5%    0%   11us(3880us)    0us                   0us           2382129
100%    0%(   0%)    getblk+0x60
____________________________________________________________________________
_____________________________________________
Number of read locks found=0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: The cause of the "VM" performance problem with 2.4.X
@ 2001-08-22 22:23 Van Maren, Kevin
  0 siblings, 0 replies; 14+ messages in thread
From: Van Maren, Kevin @ 2001-08-22 22:23 UTC (permalink / raw)
  To: 'Andrew Morton'; +Cc: 'linux-kernel@vger.kernel.org'

Thanks Andrew!  I'll try out your patch shortly, and I'll let you
know how it goes.

> Note how fsync_dev() passes the target device to sync_buffers().  But
> the dirty buffer list is global.  So to write out the dirty buffers
> for a particular device, write_locked_buffers() has to do a linear
> walk of the dirty buffers for *other* devices to find the target
> device.

I had just figured that out.  I hadn't changed the code yet, though.

The other thing that exacerbates the start-from-0 approach is only
flushing 32 blocks at a time.  Especially if we take a long time to
find the first one because of the linear search, not doing much
work once we get there is also a problem.

> And write_unlocked_buffers() uses a quite common construct - it
> scans a list but when it drops the lock, it restarts the scan
> from the start of the list.  (We do this all over the kernel, and
> it keeps on biting us).

It sounds like we want a per-device list, not a global (linear) one.
But I'm sure there are other places in the kernel where the per-
device list would introduce problems (but would make things like
sync() better, since we could flush all devices more in parallel).
But there are probably also problems there when scaling to 100's or
1000's of disks.

> So if the dirty buffer list has 10,000 buffers for device A and
> then 10,000 buffers for device B, and you call fsync_dev(B),
> we end up traversing the 10,000 buffers of device A 10,000/32 times,
> which is a lot.
> 
> In fact, write_unlocked_buffers(A) shoots itself in the foot by
> moving buffers for device A onto BUF_LOCKED, and then restarting the
> scan.  So of *course* we end up with zillions on non-A buffers at the
> head of the list.

Is it safe to restart from the middle of the list after dropping
the lock?  It looks like you try to solve that problem by only using
the hint if the block is still on the dirty list.

I was going to cache that per-device info globally as a quick hack;
I was thinking about hanging it off the kdev_t, since I didn't want
a fixed-size array in write_some_buffers(), but your approach of just
having the caller track it works too.


I'm off to try the patch...


But before I go, here is another profile run, but without your patch,
to validate your hypothesis (seem to have lost several interrupts)

[ia64_spinlock_contention is time spinning on a spinlock (you know
which one); myskupbuffer1 is called each time it "skips" a buffer
in the LRU list (no dev match).
Note that in this case, 26000 elements are skipped on the "average"
call to write_some_buffers.]

Each sample counts as 0.000976562 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 50.36     40.43    40.43
ia64_spinlock_contention
 23.98     59.68    19.25     2658     7.24     8.17  write_some_buffers
 17.10     73.40    13.72                             cg_record_arc
  4.48     77.00     3.59                             mcount
  1.36     78.09     1.10    85039     0.01     0.01  _make_request
  0.82     78.75     0.66 69385716     0.00     0.00  myskipbuffer1
  0.77     79.37     0.62    85039     0.01     0.01  blk_get_queue

index % time    self  children    called     name
                                                 <spontaneous>
[1]     52.7   40.43    0.00                 ia64_spinlock_contention [1]
-----------------------------------------------
                2.68    0.34     370/2658        sync_old_buffers [11]
               16.57    2.12    2288/2658        write_unlocked_buffers [6]
[2]     28.3   19.25    2.47    2658         write_some_buffers [2]
                0.00    1.77    2658/2658        write_locked_buffers [12]
                0.66    0.00 69385716/69385716     myskipbuffer1 [16]
                0.01    0.01   85056/123329      _refile_buffer [54]
                0.01    0.00   85056/130255      _insert_into_lru_list [70]
                0.01    0.00   85056/85056       myatomic_set_buffer_clean
[93]
                0.00    0.00   85056/85056       my__refile_buffer [104]
                0.00    0.00   85056/85056       mytest_and_set_bit [113]
                0.00    0.00    2658/418442      spin_unlock_ [97]
-----------------------------------------------


Kevin Van Maren

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: The cause of the "VM" performance problem with 2.4.X
  2001-08-22  5:31 Van Maren, Kevin
@ 2001-08-22 20:19 ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2001-08-22 20:19 UTC (permalink / raw)
  To: Van Maren, Kevin; +Cc: 'linux-kernel@vger.kernel.org'

"Van Maren, Kevin" wrote:
> 
> ...
> 
> I've been running Linux on IA64 (4 CPU LION, 8GB RAM).  2.4.4+IA64 patches through
> 2.4.8+IA64 patches all exhibit "horiffic" I/O behavior [disks are basically inactive,
> with occasional flickers, but the CPUs are pegged at 100% system time] when writing
> to multiple disks using multiple CPUs.  The easiest way for me to reproduce the
> problem is to run parallel "mkfs" processes (I use 4 SCSI disks).
> 
> First thing to do is to profile the kernel, to see why all 4 of my fast IA64
> processors are pegged at 99%+ in the kernel (and see what they are doing).
> So I get the kernel profile patches from SGI (http://oss.sgi.com/projects/kernprof/)
> and patch my kernel.  Profile 30 seconds during the "mkfs" process on 4 disks
> (plus a "sync" part way through for kicks).  Below is the "interesting" part
> of the output (truncated for brevity):

Note how fsync_dev() passes the target device to sync_buffers().  But
the dirty buffer list is global.  So to write out the dirty buffers
for a particular device, write_locked_buffers() has to do a linear
walk of the dirty buffers for *other* devices to find the target
device.

And write_unlocked_buffers() uses a quite common construct - it
scans a list but when it drops the lock, it restarts the scan
from the start of the list.  (We do this all over the kernel, and
it keeps on biting us).

So if the dirty buffer list has 10,000 buffers for device A and
then 10,000 buffers for device B, and you call fsync_dev(B),
we end up traversing the 10,000 buffers of device A 10,000/32 times,
which is a lot.

In fact, write_unlocked_buffers(A) shoots itself in the foot by
moving buffers for device A onto BUF_LOCKED, and then restarting the
scan.  So of *course* we end up with zillions on non-A buffers at the
head of the list.

fsync_dev() and balance_dirty() are the culprits in this scenario - I'd
be surprised if sys_sync() displayed similar quadratic behaviour.  (Well, it
would do so if there were a lot of locked buffers on BUF_DIRTY, but there
usually aren't).

This (rather hastily tested) patch against 2.4.9 should give O(n)
behaviour in write_unlocked_buffers().  Does it help?


--- linux-2.4.9/fs/buffer.c	Thu Aug 16 12:23:19 2001
+++ linux-akpm/fs/buffer.c	Wed Aug 22 13:16:22 2001
@@ -199,7 +199,7 @@ static void write_locked_buffers(struct 
  * return without it!
  */
 #define NRSYNC (32)
-static int write_some_buffers(kdev_t dev)
+static int write_some_buffers(kdev_t dev, struct buffer_head **start_bh)
 {
 	struct buffer_head *next;
 	struct buffer_head *array[NRSYNC];
@@ -207,6 +207,12 @@ static int write_some_buffers(kdev_t dev
 	int nr;
 
 	next = lru_list[BUF_DIRTY];
+	if (start_bh && *start_bh) {
+		if ((*start_bh)->b_list == BUF_DIRTY)
+			next = *start_bh;
+		brelse(*start_bh);
+		*start_bh = NULL;
+	}
 	nr = nr_buffers_type[BUF_DIRTY] * 2;
 	count = 0;
 	while (next && --nr >= 0) {
@@ -215,8 +221,11 @@ static int write_some_buffers(kdev_t dev
 
 		if (dev && bh->b_dev != dev)
 			continue;
-		if (test_and_set_bit(BH_Lock, &bh->b_state))
+		if (test_and_set_bit(BH_Lock, &bh->b_state)) {
+			/* Shouldn't be on BUF_DIRTY */
+			__refile_buffer(bh);
 			continue;
+		}
 		if (atomic_set_buffer_clean(bh)) {
 			__refile_buffer(bh);
 			get_bh(bh);
@@ -224,6 +233,10 @@ static int write_some_buffers(kdev_t dev
 			if (count < NRSYNC)
 				continue;
 
+			if (start_bh && next) {
+				get_bh(next);
+				*start_bh = next;
+			}
 			spin_unlock(&lru_list_lock);
 			write_locked_buffers(array, count);
 			return -EAGAIN;
@@ -243,9 +256,11 @@ static int write_some_buffers(kdev_t dev
  */
 static void write_unlocked_buffers(kdev_t dev)
 {
+	struct buffer_head *start_bh = NULL;
 	do {
 		spin_lock(&lru_list_lock);
-	} while (write_some_buffers(dev));
+	} while (write_some_buffers(dev, &start_bh));
+	brelse(start_bh);
 	run_task_queue(&tq_disk);
 }
 
@@ -1117,13 +1132,15 @@ int balance_dirty_state(kdev_t dev)
 void balance_dirty(kdev_t dev)
 {
 	int state = balance_dirty_state(dev);
+	struct buffer_head *start_bh = NULL;
 
 	if (state < 0)
 		return;
 
 	/* If we're getting into imbalance, start write-out */
 	spin_lock(&lru_list_lock);
-	write_some_buffers(dev);
+	write_some_buffers(dev, &start_bh);
+	brelse(start_bh);
 
 	/*
 	 * And if we're _really_ out of balance, wait for
@@ -2607,7 +2624,7 @@ static int sync_old_buffers(void)
 		bh = lru_list[BUF_DIRTY];
 		if (!bh || time_before(jiffies, bh->b_flushtime))
 			break;
-		if (write_some_buffers(NODEV))
+		if (write_some_buffers(NODEV, NULL))
 			continue;
 		return 0;
 	}
@@ -2706,7 +2723,7 @@ int bdflush(void *startup)
 		CHECK_EMERGENCY_SYNC
 
 		spin_lock(&lru_list_lock);
-		if (!write_some_buffers(NODEV) || balance_dirty_state(NODEV) < 0) {
+		if (!write_some_buffers(NODEV, NULL) || balance_dirty_state(NODEV) < 0) {
 			wait_for_some_buffers(NODEV);
 			interruptible_sleep_on(&bdflush_wait);
 		}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* The cause of the "VM" performance problem with 2.4.X
@ 2001-08-22  5:31 Van Maren, Kevin
  2001-08-22 20:19 ` Andrew Morton
  0 siblings, 1 reply; 14+ messages in thread
From: Van Maren, Kevin @ 2001-08-22  5:31 UTC (permalink / raw)
  To: 'linux-kernel@vger.kernel.org'

There has been quite a bit of discussion on VM-related problems on the list
(but I haven't been able to follow all of it); hopefully I can add some
clarity
to the problem.  Unfortunatly, I haven't fixed it yet (give me some
time...or
fix it and send me a patch ;-)

I've been running Linux on IA64 (4 CPU LION, 8GB RAM).  2.4.4+IA64 patches
through
2.4.8+IA64 patches all exhibit "horiffic" I/O behavior [disks are basically
inactive,
with occasional flickers, but the CPUs are pegged at 100% system time] when
writing
to multiple disks using multiple CPUs.  The easiest way for me to reproduce
the
problem is to run parallel "mkfs" processes (I use 4 SCSI disks).

First thing to do is to profile the kernel, to see why all 4 of my fast IA64
processors are pegged at 99%+ in the kernel (and see what they are doing).
So I get the kernel profile patches from SGI
(http://oss.sgi.com/projects/kernprof/)
and patch my kernel.  Profile 30 seconds during the "mkfs" process on 4
disks
(plus a "sync" part way through for kicks).  Below is the "interesting" part
of the output (truncated for brevity):

Flat profile:
  
 Each sample counts as 0.000976562 seconds.
   %   cumulative   self              self     total
  time   seconds   seconds    calls  ms/call  ms/call  name
  57.49     69.66    69.66        4 17414.79 25262.34
write_unlocked_buffers
  22.92     97.43    27.78    15115     1.84     2.39  write_some_buffers
   4.77    103.22     5.78        4  1445.31  2624.65  sync_old_buffers
   3.64    107.62     4.41   483617     0.01     0.01  _make_request
   3.30    111.62     4.00                             cg_record_arc
   2.92    115.17     3.54   483618     0.01     0.01  blk_get_queue
   1.62    117.13     1.96     2975     0.66     0.66  getblk
   1.55    119.00     1.88     2910     0.65     0.65  refile_buffer
 [. . .]
  
 index % time    self  children    called     name
                                                  <spontaneous>
 [1]     86.9    0.00  105.12                 ia64_ret_from_syscall [1]
                 0.00   75.79       3/3           sys_fsync [5]
                 0.00   25.26       1/1           sys_sync [7]
                 0.00    3.77    1195/1195        sys_write [15]
                 0.00    0.14    2306/2306        sys_read [38]
                 0.00    0.07    2088/2088        sys_open [43]
 [. . .]
 -----------------------------------------------
                 0.00   25.26       1/4           sys_sync [7]
                 0.00   75.79       3/4           sys_fsync [5]
 [2]     83.5    0.00  101.05       4         fsync_dev [2]
                 0.00  101.05       4/4           sync_buffers [3]
 -----------------------------------------------
                 0.00  101.05       4/4           fsync_dev [2]
 [3]     83.5    0.00  101.05       4         sync_buffers [3]
                69.66   31.39       4/4           write_unlocked_buffers [4]
 -----------------------------------------------
                69.66   31.39       4/4           sync_buffers [3]
 [4]     83.5   69.66   31.39       4         write_unlocked_buffers [4]
                24.18    7.21   13157/15115       write_some_buffers [6]
 -----------------------------------------------
                 0.00   75.79       3/3           ia64_ret_from_syscall [1]
 [5]     62.7    0.00   75.79       3         sys_fsync [5]
                 0.00   75.79       3/4           fsync_dev [2]
                 0.00    0.00       3/5429        fget [354]
                 0.00    0.00       3/38          filemap_fdatasync [479]
                 0.00    0.00       3/3           block_fsync [577]
 -----------------------------------------------
                 3.60    1.07    1958/15115       sync_old_buffers [9]
                24.18    7.21   13157/15115       write_unlocked_buffers [4]
 [6]     29.8   27.78    8.29   15115         write_some_buffers [6]
                 0.01    8.16   15115/15115       write_locked_buffers [10]
                 0.04    0.04  483680/567312      _refile_buffer [41]
                 0.04    0.00  483680/570200      _insert_into_lru_list [57]
 -----------------------------------------------
                 0.00   25.26       1/1           ia64_ret_from_syscall [1]
 [7]     20.9    0.00   25.26       1         sys_sync [7]
                 0.00   25.26       1/4           fsync_dev [2]
 -----------------------------------------------
                                                  <spontaneous>
 [8]      8.7    0.01   10.52                 kupdate [8]
                 5.78    4.72       4/4           sync_old_buffers [9]
                 0.00    0.02       3/3           wait_for_buffers [70]
                 0.00    0.00       3/115         schedule_timeout [123]
 -----------------------------------------------
                 5.78    4.72       4/4           kupdate [8]
 [9]      8.7    5.78    4.72       4         sync_old_buffers [9]
                 3.60    1.07    1958/15115       write_some_buffers [6]
                 0.00    0.05       4/4           sync_unlocked_inodes [53]
                 0.00    0.00       4/4           sync_supers [575]
 -----------------------------------------------
                 0.01    8.16   15115/15115       write_some_buffers [6]
 [10]     6.8    0.01    8.16   15115         write_locked_buffers [10]
                 0.14    8.02  483617/483618      submit_bh [11]
 -----------------------------------------------
                 0.00    0.00       1/483618      ll_rw_block [269]
                 0.14    8.02  483617/483618      write_locked_buffers [10]
 [11]     6.7    0.14    8.02  483618         submit_bh [11]
                 0.06    7.96  483618/483618      generic_make_request [12]
 -----------------------------------------------
                 0.06    7.96  483618/483618      submit_bh [11]
 [12]     6.6    0.06    7.96  483618         generic_make_request [12]
                 4.41    0.01  483617/483617      _make_request [13]
                 3.54    0.00  483618/483618      blk_get_queue [17]
 -----------------------------------------------
 

Okay, that was helpful.  But Yikes!  Why would write_unlocked_buffers take
so long?
It MUST be a locking problem: all it does it acquire a lock and
write_some_buffers().
That's fine, I'll profile the locks using SGI's handy-dandy lockmeter lock
profiling code (http://oss.sgi.com/projects/lockmeter/).  Hack it up to
apply
to my hacked 2.4.8 kernel, and away we go.  Run over 30 seconds of 4*mkfs.
Goes noticably faster than when I was profiling the whole kernel...

Dumping all locks that have > 1% "utilization" gives:
____________________________________________________________________________
_______________
System: Linux vanmaren.slc.unisys.com 2.4.8 #5 SMP Tue Aug 21 21:27:10 UTC
2001 ia64
Total counts

All (4) CPUs
Selecting locks that meet ALL of the following:
        utilization :   >  1.00%


Start time: Tue Aug 21 22:05:45 2001
End   time: Tue Aug 21 22:06:16 2001
Delta Time: 30.19 sec.
Hash table slots in use:      228.
Global read lock slots in use: 71.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - -
SPINLOCKS         HOLD            WAIT
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )(% CPU)     TOTAL NOWAIT SPIN
RJECT  NAME

       26.9%   24us(  49ms)  241us(  59ms)(71.4%)   1338023 73.1% 26.8%
0.09%  *TOTAL*

  2.4%  3.3%   19us(2035us)    0us                    37909 96.7%    0%
3.3%  global_bh_lock
  2.4%  3.3%   19us(2035us)    0us                    37909 96.7%    0%
3.3%    bh_action+0x40

  5.2% 32.1%  1.5us(  69us)  9.4us( 228us)( 2.7%)   1072409 67.9% 32.1%
0%  io_request_lock
 0.00% 59.9%  0.9us( 2.7us)   13us( 115us)(0.00%)       322 40.1% 59.9%
0%    __get_request_wait+0xb0
  4.0% 19.9%  2.3us(  69us)  4.9us( 196us)(0.42%)    525687 80.1% 19.9%
0%    __make_request+0x270
 0.03% 18.5%  2.3us(  10us)   12us( 105us)(0.01%)      3531 81.5% 18.5%
0%    ahc_linux_isr+0x3d0
 0.92% 44.9%  0.5us( 3.9us)   11us( 228us)( 2.2%)    525536 55.1% 44.9%
0%    blk_get_queue+0x50
 0.00% 16.5%  2.6us(  18us)  7.7us(  42us)(0.00%)       243 83.5% 16.5%
0%    generic_unplug_device+0x40
 0.20% 21.3%   14us(  43us)   13us( 111us)(0.01%)      4399 78.7% 21.3%
0%    scsi_dispatch_cmd+0x2b0
 0.01% 16.0%  0.5us( 3.5us)   14us( 166us)(0.01%)      4146 84.0% 16.0%
0%    scsi_finish_command+0x60
 0.03% 16.8%  1.8us(  44us)   13us( 104us)(0.01%)      4146 83.2% 16.8%
0%    scsi_queue_next_request+0x40
 0.04%  6.4%  2.9us(  26us)  4.6us(  72us)(0.00%)      4399 93.6%  6.4%
0%    scsi_request_fn+0x510

 97.0% 60.3% 1253us(  49ms) 5886us(  59ms)(68.7%)     23383 39.7% 60.3%
0%  lru_list_lock
 0.00%    0%  1.0us( 1.0us)    0us                        1  100%    0%
0%    buffer_insert_inode_queue+0x30
 0.15% 11.4%   13us(  22us) 4396us(  30ms)( 1.5%)      3503 88.6% 11.4%
0%    getblk+0x40
 0.33% 33.3%   33ms(  49ms)  131us( 131us)(0.00%)         3 66.7% 33.3%
0%    kupdate+0x190
 0.00% 14.6%  0.4us( 2.8us) 3865us(  30ms)( 1.6%)      3447 85.4% 14.6%
0%    refile_buffer+0x30
 0.05% 60.8%   15us(  26us) 1256us(8562us)(0.66%)      1037 39.2% 60.8%
0%    sync_old_buffers+0x100
 0.00%    0%  6.6us(  11us)    0us                        2  100%    0%
0%    try_to_free_buffers+0x90
 96.5% 81.6% 1893us(9616us) 6247us(  59ms)(65.0%)     15390 18.4% 81.6%
0%    write_unlocked_buffers+0x40

- - - - - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - -
RWLOCK READS   HOLD    MAX  RDR BUSY PERIOD      WAIT
  UTIL  CON    MEAN   RDRS   MEAN(  MAX )   MEAN(  MAX )( %CPU)     TOTAL
NOWAIT SPIN  NAME

       0.01%                               1.2us( 1.2us)(0.00%)     14867
100% 0.01%  *TOTAL*

- - - - - - - - - - -  - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - - - - - - - - - - - - - - - - - - - - - -
RWLOCK WRITES     HOLD           WAIT (ALL)           WAIT (WW)
  UTIL  CON    MEAN(  MAX )   MEAN(  MAX )( %CPU)   MEAN(  MAX )     TOTAL
NOWAIT SPIN(  WW )  NAME

          0%  0.8us(  39us)    0us                   0us             72355
100%    0%(   0%)  *TOTAL*
____________________________________________________________________________
_____________________________________________
Number of read locks found=0


Yep, lru_list_lock is the problem, and it is exacerbated by
write_unlocked_buffers holding it
for SO long.

The code in question (2.4.8 that I was testing with) is:

  
 static void write_unlocked_buffers(kdev_t dev)
 {
         do {
                 spin_lock(&lru_list_lock);
         } while (write_some_buffers(dev));
         run_task_queue(&tq_disk);
 }
  
 /*
  * Write some buffers from the head of the dirty queue.
  *
  * This must be called with the LRU lock held, and will
  * return without it!
  */
 #define NRSYNC (32)
 static int write_some_buffers(kdev_t dev)
 {
         struct buffer_head *next;
         struct buffer_head *array[NRSYNC];
         unsigned int count;
         int nr;
  
         next = lru_list[BUF_DIRTY];
         nr = nr_buffers_type[BUF_DIRTY] * 2;
         count = 0;
         while (next && --nr >= 0) {
                 struct buffer_head * bh = next;
                 next = bh->b_next_free;
  
                 if (dev && bh->b_dev != dev)
                         continue;
                 if (test_and_set_bit(BH_Lock, &bh->b_state))
                         continue;
                 if (atomic_set_buffer_clean(bh)) {
                         __refile_buffer(bh);
                         get_bh(bh);
                         array[count++] = bh;
                         if (count < NRSYNC)
                                 continue;
  
                         spin_unlock(&lru_list_lock);
                         write_locked_buffers(array, count);
                         return -EAGAIN;
                 }
                 unlock_buffer(bh);
                 __refile_buffer(bh);
         }
         spin_unlock(&lru_list_lock);
  
         if (count)
                 write_locked_buffers(array, count);
         return 0;
 }


It looks like more investigation is in order, but it appears that the
lru_list_lock
is held for too long, and that write_some_buffers() is a good candidate for
SMP
improvement.  Tomorrow, I'll track down where the time is going in
write_some_buffers().


Kevin Van Maren

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2001-08-29 13:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-08-28 17:35 The cause of the "VM" performance problem with 2.4.X Van Maren, Kevin
2001-08-28 18:52 ` Linus Torvalds
2001-08-28 19:29   ` André Dahlqvist
2001-08-29 13:49     ` Rik van Riel
2001-08-29  8:22   ` Jens Axboe
2001-08-29  8:25     ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2001-08-23 17:26 Van Maren, Kevin
2001-08-23 17:06 Van Maren, Kevin
2001-08-23 17:18 ` Andrew Morton
2001-08-23  1:48 Van Maren, Kevin
2001-08-23 16:33 ` Andrew Morton
2001-08-22 22:23 Van Maren, Kevin
2001-08-22  5:31 Van Maren, Kevin
2001-08-22 20:19 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).