hi!

I've been trying to implement new locking schema for JBD
(Journaling Block Device). JBD is well-known bottleneck
for some configurations and loads.

The main ideas of locking design:

1) we do not lock the whole journal trying to get access for
   some buffer, we do lock buffer only. let's call this lock
   'bh lock'. in fact, this lock is simple one-bit state in
   bh->b_state field. there are primitives to operate on this
   lock: jbd_lock_bh(), jbd_unlock_bh() and jbd_bh_locked().
   any operation on jh must be protected by this lock

2) each transaction has own lock to protect buffer list.
   journal_file_buffer() and journal_unfile_buffer() uses
   jh->j_transaction to find that lock. jh->j_transaction is
   protected by bh lock. so, every time one tries to get write
   access for a buffer following locking will be used:
   
   get_write_access(bh)
   {
     jbd_lock_bh(bh);
     /* decide what to do with buffer: wait, file it, etc */
     journal_file_buffer(jh, th, BJ_Metadata);
     {
       spin_lock(&th->t_list_lock);
       /* add buffer to transaction's list */
       spin_unlock(&th->t_list_lock);
     }       
     jbd_unlock_bh(bh);
   }

   while transaction is T_RUNNING state all proccessing go throught
   this lock order. invalidatpage(), releasepage() and dirty_data()
   also use this order. journal_commit_transaction() accesses buffers
   in another order:
   for_each_buffer_in_list(list) {
     jbd_lock_bh(bh);
     /* process it */
     jbd_unlock_bh(bh);
   }
    
   so, it looks like lock ordeding violation. but, it isn't, because
   this buffer is owned by commiting transaction and must not be refiled
   by running transaction. the only places are flushing ordered data in
   journal_commit_transaction() against journal_releasepage() and
   journal_commit_transaction() against journal_dirty_data().
   journal_commit_transaction() walks throught the list of transaction's
   data buffers and journal_releasepage() first looks at buffer (so gets
   bh lock), then refile it (so gets t_list_lock) => possible deadlock.
   at this moment I use following schema:

   lock(transaction->t_list_lock);
   for_each_buffer_in_list(bh) {
     get_bh(bh);
     put bh in special array
   unlock(transaction->t_list_lock);

   for_each_buffer_in_special_array(bh) {
     jbd_lock_bh(bh);
     jh = bh2jh2(bh);
     if (buffer belongs to the same transaction AND
         buffer is on the same list) {
           /* process buffer */
     }
     jbd_unlock_bh(bh);
     put_bh(bh);
   }

3) transaction's state and credits are protected by transaction->t_lock
   
3) revoke list protection
   as we may have one running transaction and one committing transaction
   at the same time, it's indeed that we simple need two revoke lists:
   one for running transaction and one for committing transaction.
   processes may modify revoke list simultaneously, so we protect current
   revoke list by journal->j_revoke_lock

4) every time, journal_commit_transaction() starts to commit new transaction,
   journal->j_running_transaction is set to NULL several start_this_handle()
   may try to allocate new transaction. in order to make this SMP-compatible
   get_transaction() uses journal->j_lock.

5) to protect list of committed transaction JDB uses journal->j_checkpoint_lock

6) log_do_checkpoint() scans list of transactions and list of buffers to be
   flushed. it competes with journal_commit_transaction(). once again, here is
   incompatible access order. I use schema, described in item 2.


The patch I'm sending have been tested for dozens of hours by
fsx-linux & bash-shared-mapping & make -j8 bzImage on dual
pIII-1GHz with 512MB RAM. Preempt was off. Patch is against
2.5.68-mm1.

I'd like to thank Andrew Morton for huge help.

with best regards, Alex

PS. would be happy to hear any comments/suggestions ;)