[PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme
@ 2009-05-01  2:44 Frederic Weisbecker
  2009-05-01  2:44 ` [PATCH 1/6] kill-the-BKL/reiserfs: release write lock on fs_changed() Frederic Weisbecker
                   ` (8 more replies)
  0 siblings, 9 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01  2:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jeff Mahoney,
	ReiserFS Development List, Chris Mason, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro

Hi,

This reiserfs patchset applies against latest tip:core/kill-the-BKL
It adds various explicit write lock releases on specific sleeping sections.

A performance test with dbench on UP with 100 processus during 100 seconds
gives the following results:

Locking		  Throughput 

Bkl:              11.2587 MB/s
Write lock/Mutex: 12.5713 MB/s

So the new locking scheme makes it 11% faster than with the bkl.

It's not possible to compare it on the kill-the-BKL tree because the Bkl
is not anymore a Bkl inside but a plain Mutex.

Instead, you can apply the following equivalent patch against -rc3 to test it:
http://www.kernel.org/pub/linux/kernel/people/frederic/reiserfs-kill-the-bkl-full.patch

Of course it might eat your data, make you cows produce black milk, bring coffee
to your children at 3:00 am, turn the teletubbies song in your mind for
seven years long and so...

Frederic.

The following changes since commit a3a2b793d18bc068b79508e96eba33ae2326f759:
  Alessio Igor Bogani (1):
        remove the BKL: remove "BKL auto-drop" assumption from ext3_remount()

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git bkl

Frederic Weisbecker (6):
      kill-the-BKL/reiserfs: release write lock on fs_changed()
      kill-the-BKL/reiserfs: release the write lock before rescheduling on do_journal_end()
      kill-the-BKL/reiserfs: release write lock while rescheduling on prepare_for_delete_or_cut()
      kill-the-BKL/reiserfs: release the write lock inside get_neighbors()
      kill-the-BKL/reiserfs: release the write lock inside reiserfs_read_bitmap_block()
      kill-the-BKL/reiserfs: release the write lock on flush_commit_list()

 fs/reiserfs/bitmap.c        |    2 ++
 fs/reiserfs/fix_node.c      |    4 ++++
 fs/reiserfs/journal.c       |    9 +++++++--
 fs/reiserfs/stree.c         |    2 ++
 include/linux/reiserfs_fs.h |    8 +++++++-
 5 files changed, 22 insertions(+), 3 deletions(-)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/6] kill-the-BKL/reiserfs: release write lock on fs_changed()
  2009-05-01  2:44 [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme Frederic Weisbecker
@ 2009-05-01  2:44 ` Frederic Weisbecker
  2009-05-01  6:31   ` Andi Kleen
  2009-05-01  2:44 ` [PATCH 2/6] kill-the-BKL/reiserfs: release the write lock before rescheduling on do_journal_end() Frederic Weisbecker
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01  2:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jeff Mahoney,
	ReiserFS Development List, Chris Mason, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro

fs_changed() is a macro used by reiserfs to check whether its tree has been
rebalanced. It has been designed to check parallel changes on the tree after
calling a sleeping function, which released the Bkl.

fs_changed() also calls cond_resched(), so that if rescheduling is needed,
we are in the best place to do that, since we check if the tree has changed
just after (because of the bkl release on schedule()).

Even if we are not anymore using the Bkl, we still want to release the lock
while we reschedule, so that other waiters for the lock can acquire it safely,
because of the following __fs_changed() check.

[ Impact: release the reiserfs write lock when it is not needed ]

Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 include/linux/reiserfs_fs.h |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h
index 6587b4e..397d281 100644
--- a/include/linux/reiserfs_fs.h
+++ b/include/linux/reiserfs_fs.h
@@ -1302,7 +1302,13 @@ static inline loff_t max_reiserfs_offset(struct inode *inode)
 #define get_generation(s) atomic_read (&fs_generation(s))
 #define FILESYSTEM_CHANGED_TB(tb)  (get_generation((tb)->tb_sb) != (tb)->fs_gen)
 #define __fs_changed(gen,s) (gen != get_generation (s))
-#define fs_changed(gen,s) ({cond_resched(); __fs_changed(gen, s);})
+#define fs_changed(gen,s)		\
+({					\
+	reiserfs_write_unlock(s);	\
+	cond_resched();			\
+	reiserfs_write_lock(s);		\
+	__fs_changed(gen, s);		\
+})
 
 /***************************************************************************/
 /*                  FIXATE NODES                                           */
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/6] kill-the-BKL/reiserfs: release the write lock before rescheduling on do_journal_end()
  2009-05-01  2:44 [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme Frederic Weisbecker
  2009-05-01  2:44 ` [PATCH 1/6] kill-the-BKL/reiserfs: release write lock on fs_changed() Frederic Weisbecker
@ 2009-05-01  2:44 ` Frederic Weisbecker
  2009-05-01  7:09   ` Ingo Molnar
  2009-05-01  2:44 ` [PATCH 3/6] kill-the-BKL/reiserfs: release write lock while rescheduling on prepare_for_delete_or_cut() Frederic Weisbecker
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01  2:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jeff Mahoney,
	ReiserFS Development List, Chris Mason, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro

When do_journal_end() copies data to the journal blocks buffers in memory,
it reschedules if needed between each block copied and dirtyfied.

We can also release the write lock at this rescheduling stage,
like did the bkl implicitly.

[ Impact: release the reiserfs write lock when it is not needed ]

Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 fs/reiserfs/journal.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 7976d7d..373d080 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -4232,7 +4232,9 @@ static int do_journal_end(struct reiserfs_transaction_handle *th,
 		next = cn->next;
 		free_cnode(sb, cn);
 		cn = next;
+		reiserfs_write_unlock(sb);
 		cond_resched();
+		reiserfs_write_lock(sb);
 	}
 
 	/* we are done  with both the c_bh and d_bh, but
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/6] kill-the-BKL/reiserfs: release write lock while rescheduling on prepare_for_delete_or_cut()
  2009-05-01  2:44 [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme Frederic Weisbecker
  2009-05-01  2:44 ` [PATCH 1/6] kill-the-BKL/reiserfs: release write lock on fs_changed() Frederic Weisbecker
  2009-05-01  2:44 ` [PATCH 2/6] kill-the-BKL/reiserfs: release the write lock before rescheduling on do_journal_end() Frederic Weisbecker
@ 2009-05-01  2:44 ` Frederic Weisbecker
  2009-05-01  2:44 ` [PATCH 4/6] kill-the-BKL/reiserfs: release the write lock inside get_neighbors() Frederic Weisbecker
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01  2:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jeff Mahoney,
	ReiserFS Development List, Chris Mason, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro

prepare_for_delete_or_cut() can process several types of items, including
indirect items, ie: items which contain no file data but pointers to
unformatted nodes scattering the datas of a file.

In this case it has to zero out these pointers to block numbers of
unformatted nodes and release the bitmap from these block numbers.

It can take some time, so a rescheduling() is performed between each
block processed. We can safely release the write lock while
rescheduling(), like the bkl did, because the code checks just after
if the item has moved after sleeping.

[ Impact: release the reiserfs write lock when it is not needed ]

Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 fs/reiserfs/stree.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/reiserfs/stree.c b/fs/reiserfs/stree.c
index 6bd99a9..6ddcecb 100644
--- a/fs/reiserfs/stree.c
+++ b/fs/reiserfs/stree.c
@@ -1026,7 +1026,9 @@ static char prepare_for_delete_or_cut(struct reiserfs_transaction_handle *th, st
 			reiserfs_free_block(th, inode, block, 1);
 		    }
 
+		    reiserfs_write_unlock(sb);
 		    cond_resched();
+		    reiserfs_write_lock(sb);
 
 		    if (item_moved (&s_ih, path))  {
 			need_re_search = 1;
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/6] kill-the-BKL/reiserfs: release the write lock inside get_neighbors()
  2009-05-01  2:44 [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2009-05-01  2:44 ` [PATCH 3/6] kill-the-BKL/reiserfs: release write lock while rescheduling on prepare_for_delete_or_cut() Frederic Weisbecker
@ 2009-05-01  2:44 ` Frederic Weisbecker
  2009-05-01  5:51   ` Ingo Molnar
  2009-05-01  2:44 ` [PATCH 5/6] kill-the-BKL/reiserfs: release the write lock inside reiserfs_read_bitmap_block() Frederic Weisbecker
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01  2:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jeff Mahoney,
	ReiserFS Development List, Chris Mason, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro

get_neighbors() is used to get the left and/or right blocks
against a given one in order to balance a tree.

sb_bread() is used to read the buffer of these neighors blocks and
while it waits for this operation, it might sleep.

The bkl was released at this point, and then we can also release
the write lock before calling sb_bread().

This is safe because if the filesystem is changed after this
lock release, the function returns REPEAT_SEARCH (aka SCHEDULE_OCCURRED
in the function header comments) in order to repeat the neighbhor
research.

[ Impact: release the reiserfs write lock when it is not needed ]

Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 fs/reiserfs/fix_node.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/fs/reiserfs/fix_node.c b/fs/reiserfs/fix_node.c
index bf5f2cb..3a685e3 100644
--- a/fs/reiserfs/fix_node.c
+++ b/fs/reiserfs/fix_node.c
@@ -1971,7 +1971,9 @@ static int get_neighbors(struct tree_balance *tb, int h)
 		     tb->FL[h]) ? tb->lkey[h] : B_NR_ITEMS(tb->
 								       FL[h]);
 		son_number = B_N_CHILD_NUM(tb->FL[h], child_position);
+		reiserfs_write_unlock(sb);
 		bh = sb_bread(sb, son_number);
+		reiserfs_write_lock(sb);
 		if (!bh)
 			return IO_ERROR;
 		if (FILESYSTEM_CHANGED_TB(tb)) {
@@ -2009,7 +2011,9 @@ static int get_neighbors(struct tree_balance *tb, int h)
 		child_position =
 		    (bh == tb->FR[h]) ? tb->rkey[h] + 1 : 0;
 		son_number = B_N_CHILD_NUM(tb->FR[h], child_position);
+		reiserfs_write_unlock(sb);
 		bh = sb_bread(sb, son_number);
+		reiserfs_write_lock(sb);
 		if (!bh)
 			return IO_ERROR;
 		if (FILESYSTEM_CHANGED_TB(tb)) {
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/6] kill-the-BKL/reiserfs: release the write lock inside reiserfs_read_bitmap_block()
  2009-05-01  2:44 [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme Frederic Weisbecker
                   ` (3 preceding siblings ...)
  2009-05-01  2:44 ` [PATCH 4/6] kill-the-BKL/reiserfs: release the write lock inside get_neighbors() Frederic Weisbecker
@ 2009-05-01  2:44 ` Frederic Weisbecker
  2009-05-01  5:47   ` Ingo Molnar
  2009-05-01  2:44 ` [PATCH 6/6] kill-the-BKL/reiserfs: release the write lock on flush_commit_list() Frederic Weisbecker
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01  2:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jeff Mahoney,
	ReiserFS Development List, Chris Mason, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro

reiserfs_read_bitmap_block() uses sb_bread() to read the bitmap block. This
helper might sleep.

Then, when the bkl was used, it was released at this point. We can then
relax the write lock too here.

[ Impact: release the reiserfs write lock when it is not needed ]

Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 fs/reiserfs/bitmap.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/reiserfs/bitmap.c b/fs/reiserfs/bitmap.c
index 1470334..6854957 100644
--- a/fs/reiserfs/bitmap.c
+++ b/fs/reiserfs/bitmap.c
@@ -1249,7 +1249,9 @@ struct buffer_head *reiserfs_read_bitmap_block(struct super_block *sb,
 	else if (bitmap == 0)
 		block = (REISERFS_DISK_OFFSET_IN_BYTES >> sb->s_blocksize_bits) + 1;
 
+	reiserfs_write_unlock(sb);
 	bh = sb_bread(sb, block);
+	reiserfs_write_lock(sb);
 	if (bh == NULL)
 		reiserfs_warning(sb, "sh-2029: %s: bitmap block (#%u) "
 		                 "reading failed", __func__, block);
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 6/6] kill-the-BKL/reiserfs: release the write lock on flush_commit_list()
  2009-05-01  2:44 [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme Frederic Weisbecker
                   ` (4 preceding siblings ...)
  2009-05-01  2:44 ` [PATCH 5/6] kill-the-BKL/reiserfs: release the write lock inside reiserfs_read_bitmap_block() Frederic Weisbecker
@ 2009-05-01  2:44 ` Frederic Weisbecker
  2009-05-01  5:42   ` Ingo Molnar
  2009-05-01  5:35 ` [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme Ingo Molnar
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01  2:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Frederic Weisbecker, Jeff Mahoney,
	ReiserFS Development List, Chris Mason, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro

flush_commit_list() uses ll_rw_block() to commit the pending log blocks.
ll_rw_block() might sleep, and the bkl was released at this point. Then
we can also relax the write lock at this point.

[ Impact: release the reiserfs write lock when it is not needed ]

Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Alexander Beregalov <a.beregalov@gmail.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 fs/reiserfs/journal.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 373d080..b1ebd5a 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -1120,8 +1120,11 @@ static int flush_commit_list(struct super_block *s,
 		    SB_ONDISK_JOURNAL_SIZE(s);
 		tbh = journal_find_get_block(s, bn);
 		if (tbh) {
-			if (buffer_dirty(tbh))
-			    ll_rw_block(WRITE, 1, &tbh) ;
+			if (buffer_dirty(tbh)) {
+		            reiserfs_write_unlock(s);
+			    ll_rw_block(WRITE, 1, &tbh);
+			    reiserfs_write_lock(s);
+			}
 			put_bh(tbh) ;
 		}
 	}
-- 
1.6.2.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme
  2009-05-01  2:44 [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme Frederic Weisbecker
                   ` (5 preceding siblings ...)
  2009-05-01  2:44 ` [PATCH 6/6] kill-the-BKL/reiserfs: release the write lock on flush_commit_list() Frederic Weisbecker
@ 2009-05-01  5:35 ` Ingo Molnar
  2009-05-01 12:18   ` Thomas Meyer
  2009-05-01 19:59 ` Linus Torvalds
  2009-05-02  1:39 ` Frederic Weisbecker
  8 siblings, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2009-05-01  5:35 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Jeff Mahoney, ReiserFS Development List, Chris Mason,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro, Linus Torvalds


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> Hi,
> 
> This reiserfs patchset applies against latest tip:core/kill-the-BKL
> It adds various explicit write lock releases on specific sleeping sections.
> 
> A performance test with dbench on UP with 100 processus during 100 seconds
> gives the following results:
> 
> Locking		  Throughput 
> 
> Bkl:              11.2587 MB/s
> Write lock/Mutex: 12.5713 MB/s
> 
> So the new locking scheme makes it 11% faster than with the bkl.

Wow, nice!

> It's not possible to compare it on the kill-the-BKL tree because the Bkl
> is not anymore a Bkl inside but a plain Mutex.
> 
> Instead, you can apply the following equivalent patch against -rc3 to test it:
> http://www.kernel.org/pub/linux/kernel/people/frederic/reiserfs-kill-the-bkl-full.patch
> 
> Of course it might eat your data, make you cows produce black milk, bring coffee
> to your children at 3:00 am, turn the teletubbies song in your mind for
> seven years long and so...
> 
> Frederic.
> 
> The following changes since commit a3a2b793d18bc068b79508e96eba33ae2326f759:
>   Alessio Igor Bogani (1):
>         remove the BKL: remove "BKL auto-drop" assumption from ext3_remount()
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing.git bkl
> 
> Frederic Weisbecker (6):
>       kill-the-BKL/reiserfs: release write lock on fs_changed()
>       kill-the-BKL/reiserfs: release the write lock before rescheduling on do_journal_end()
>       kill-the-BKL/reiserfs: release write lock while rescheduling on prepare_for_delete_or_cut()
>       kill-the-BKL/reiserfs: release the write lock inside get_neighbors()
>       kill-the-BKL/reiserfs: release the write lock inside reiserfs_read_bitmap_block()
>       kill-the-BKL/reiserfs: release the write lock on flush_commit_list()
> 
>  fs/reiserfs/bitmap.c        |    2 ++
>  fs/reiserfs/fix_node.c      |    4 ++++
>  fs/reiserfs/journal.c       |    9 +++++++--
>  fs/reiserfs/stree.c         |    2 ++
>  include/linux/reiserfs_fs.h |    8 +++++++-
>  5 files changed, 22 insertions(+), 3 deletions(-)

I've pulled it and have also merged -rc4 into the kill-the-BKL tree, 
which can picked up from here:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git core/kill-the-BKL

So for comparative benchmarking, vanilla v2.6.30-rc4 (which has the 
BKL) can be compared against latest kill-the-BKL.

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/6] kill-the-BKL/reiserfs: release the write lock on flush_commit_list()
  2009-05-01  2:44 ` [PATCH 6/6] kill-the-BKL/reiserfs: release the write lock on flush_commit_list() Frederic Weisbecker
@ 2009-05-01  5:42   ` Ingo Molnar
  2009-05-01 13:13     ` Frederic Weisbecker
  0 siblings, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2009-05-01  5:42 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Jeff Mahoney, ReiserFS Development List, Chris Mason,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> flush_commit_list() uses ll_rw_block() to commit the pending log blocks.
> ll_rw_block() might sleep, and the bkl was released at this point. Then
> we can also relax the write lock at this point.
> 
> [ Impact: release the reiserfs write lock when it is not needed ]
> 
> Cc: Jeff Mahoney <jeffm@suse.com>
> Cc: Chris Mason <chris.mason@oracle.com>
> Cc: Alexander Beregalov <a.beregalov@gmail.com>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> ---
>  fs/reiserfs/journal.c |    7 +++++--
>  1 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
> index 373d080..b1ebd5a 100644
> --- a/fs/reiserfs/journal.c
> +++ b/fs/reiserfs/journal.c
> @@ -1120,8 +1120,11 @@ static int flush_commit_list(struct super_block *s,
>  		    SB_ONDISK_JOURNAL_SIZE(s);
>  		tbh = journal_find_get_block(s, bn);
>  		if (tbh) {
> -			if (buffer_dirty(tbh))
> -			    ll_rw_block(WRITE, 1, &tbh) ;
> +			if (buffer_dirty(tbh)) {
> +		            reiserfs_write_unlock(s);
> +			    ll_rw_block(WRITE, 1, &tbh);
> +			    reiserfs_write_lock(s);
> +			}
>  			put_bh(tbh) ;
>  		}
>  	}

there's 7 other instances of ll_rw_block():

fs/reiserfs/journal.c-			spin_unlock(lock);
fs/reiserfs/journal.c:			ll_rw_block(WRITE, 1, &bh);
fs/reiserfs/journal.c-			spin_lock(lock);
--
fs/reiserfs/journal.c-		            reiserfs_write_unlock(s);
fs/reiserfs/journal.c:			    ll_rw_block(WRITE, 1, &tbh);
fs/reiserfs/journal.c-			    reiserfs_write_lock(s);
--
fs/reiserfs/journal.c-	/* read in the log blocks, memcpy to the corresponding real block */
fs/reiserfs/journal.c:	ll_rw_block(READ, get_desc_trans_len(desc), log_blocks);
fs/reiserfs/journal.c-	for (i = 0; i < get_desc_trans_len(desc); i++) {
--
fs/reiserfs/journal.c-		set_buffer_dirty(real_blocks[i]);
fs/reiserfs/journal.c:		ll_rw_block(SWRITE, 1, real_blocks + i);
fs/reiserfs/journal.c-	}
--
fs/reiserfs/journal.c-	}
fs/reiserfs/journal.c:	ll_rw_block(READ, j, bhlist);
fs/reiserfs/journal.c-	for (i = 1; i < j; i++)
--
fs/reiserfs/stree.c-		if (!buffer_uptodate(bh[j]))
fs/reiserfs/stree.c:			ll_rw_block(READA, 1, bh + j);
fs/reiserfs/stree.c-		brelse(bh[j]);
--
fs/reiserfs/stree.c-						    reada_blocks, reada_count);
fs/reiserfs/stree.c:			ll_rw_block(READ, 1, &bh);
fs/reiserfs/stree.c-			reiserfs_write_unlock(sb);
--
fs/reiserfs/super.c-{
fs/reiserfs/super.c:	ll_rw_block(READ, 1, &(SB_BUFFER_WITH_SB(s)));
fs/reiserfs/super.c-	reiserfs_write_unlock(s);

in particular the second stree.c one and the super.c has a 
write-unlock straight before the lock-drop.

I think the stree.c unlock could be moved to before the 
ll_rw_block() call straight away.

The super.c one needs more care: first put &(SB_BUFFER_WITH_SB(s)) 
into a local variable, then unlock the wite-lock, then call 
ll_rw_block(). (This is important because &(SB_BUFFER_WITH_SB(s)) is 
global filesystem state that has to be read with the lock held.)

ll_rw_block() generally always has a chance to block (especially on 
READ) - so the other places could be converted to drop the 
write-lock too. Most seem straightforward - some need similar 
local-variable treatment as super.c.

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] kill-the-BKL/reiserfs: release the write lock inside reiserfs_read_bitmap_block()
  2009-05-01  2:44 ` [PATCH 5/6] kill-the-BKL/reiserfs: release the write lock inside reiserfs_read_bitmap_block() Frederic Weisbecker
@ 2009-05-01  5:47   ` Ingo Molnar
  2009-05-01 13:19     ` Frederic Weisbecker
  0 siblings, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2009-05-01  5:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Jeff Mahoney, ReiserFS Development List, Chris Mason,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> reiserfs_read_bitmap_block() uses sb_bread() to read the bitmap block. This
> helper might sleep.
> 
> Then, when the bkl was used, it was released at this point. We can then
> relax the write lock too here.
> 
> [ Impact: release the reiserfs write lock when it is not needed ]
> 
> Cc: Jeff Mahoney <jeffm@suse.com>
> Cc: Chris Mason <chris.mason@oracle.com>
> Cc: Alexander Beregalov <a.beregalov@gmail.com>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> ---
>  fs/reiserfs/bitmap.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/reiserfs/bitmap.c b/fs/reiserfs/bitmap.c
> index 1470334..6854957 100644
> --- a/fs/reiserfs/bitmap.c
> +++ b/fs/reiserfs/bitmap.c
> @@ -1249,7 +1249,9 @@ struct buffer_head *reiserfs_read_bitmap_block(struct super_block *sb,
>  	else if (bitmap == 0)
>  		block = (REISERFS_DISK_OFFSET_IN_BYTES >> sb->s_blocksize_bits) + 1;
>  
> +	reiserfs_write_unlock(sb);
>  	bh = sb_bread(sb, block);
> +	reiserfs_write_lock(sb);
>  	if (bh == NULL)
>  		reiserfs_warning(sb, "sh-2029: %s: bitmap block (#%u) "
>  		                 "reading failed", __func__, block);

Note, there's a side-effect here: the access to sb->b_blocksize is 
moved outside of the lock. Previously it was accessed via the BKL. 
On a mounted filesystem sb->b_blocksize is not supposed to change, 
so it's probably not an issue - but wanted to mention it.

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/6] kill-the-BKL/reiserfs: release the write lock inside get_neighbors()
  2009-05-01  2:44 ` [PATCH 4/6] kill-the-BKL/reiserfs: release the write lock inside get_neighbors() Frederic Weisbecker
@ 2009-05-01  5:51   ` Ingo Molnar
  2009-05-01 13:25     ` Frederic Weisbecker
  0 siblings, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2009-05-01  5:51 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Jeff Mahoney, ReiserFS Development List, Chris Mason,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> get_neighbors() is used to get the left and/or right blocks 
> against a given one in order to balance a tree.
> 
> sb_bread() is used to read the buffer of these neighors blocks and 
> while it waits for this operation, it might sleep.
> 
> The bkl was released at this point, and then we can also release 
> the write lock before calling sb_bread().
> 
> This is safe because if the filesystem is changed after this lock 
> release, the function returns REPEAT_SEARCH (aka SCHEDULE_OCCURRED 
> in the function header comments) in order to repeat the neighbhor 
> research.
> 
> [ Impact: release the reiserfs write lock when it is not needed ]

This should also be safe because under the BKL we _already_ dropped 
the lock when sb_bread() blocked (which it really would in the 
normal case).

There's one special case to consider though: sb_read() maps to 
__bread() which can return without sleeping if the bh is already 
uptodate. So if the filesystem _knows_ that the bh is already 
uptodate and holds a reference to it (this is common pattern in 
filesystems), it can have a locking assumption on that.

No such assumption seems to be present here though.

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/6] kill-the-BKL/reiserfs: release write lock on fs_changed()
  2009-05-01  2:44 ` [PATCH 1/6] kill-the-BKL/reiserfs: release write lock on fs_changed() Frederic Weisbecker
@ 2009-05-01  6:31   ` Andi Kleen
  2009-05-01 13:28     ` Frederic Weisbecker
  0 siblings, 1 reply; 39+ messages in thread
From: Andi Kleen @ 2009-05-01  6:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Jeff Mahoney, ReiserFS Development List,
	Chris Mason, Alexander Beregalov, Alessio Igor Bogani,
	Jonathan Corbet, Alexander Viro

Frederic Weisbecker <fweisbec@gmail.com> writes:
>
> diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h
> index 6587b4e..397d281 100644
> --- a/include/linux/reiserfs_fs.h
> +++ b/include/linux/reiserfs_fs.h
> @@ -1302,7 +1302,13 @@ static inline loff_t max_reiserfs_offset(struct inode *inode)
>  #define get_generation(s) atomic_read (&fs_generation(s))
>  #define FILESYSTEM_CHANGED_TB(tb)  (get_generation((tb)->tb_sb) != (tb)->fs_gen)
>  #define __fs_changed(gen,s) (gen != get_generation (s))
> -#define fs_changed(gen,s) ({cond_resched(); __fs_changed(gen, s);})
> +#define fs_changed(gen,s)		\
> +({					\
> +	reiserfs_write_unlock(s);	\
> +	cond_resched();			\
> +	reiserfs_write_lock(s);		\

Did you try writing that 

    if (need_resched()) {               \
	reiserfs_write_unlock(s);	\
	cond_resched();			\  (or schedule(), but cond_resched does a loop)
	reiserfs_write_lock(s);		\
    }				

? That might give better performance under load because users will be better
batched and you don't release the lock unnecessarily in the unloaded case.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/6] kill-the-BKL/reiserfs: release the write lock before rescheduling on do_journal_end()
  2009-05-01  2:44 ` [PATCH 2/6] kill-the-BKL/reiserfs: release the write lock before rescheduling on do_journal_end() Frederic Weisbecker
@ 2009-05-01  7:09   ` Ingo Molnar
  2009-05-01 13:31     ` Frederic Weisbecker
  0 siblings, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2009-05-01  7:09 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Jeff Mahoney, ReiserFS Development List, Chris Mason,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> When do_journal_end() copies data to the journal blocks buffers in memory,
> it reschedules if needed between each block copied and dirtyfied.
> 
> We can also release the write lock at this rescheduling stage,
> like did the bkl implicitly.
> 
> [ Impact: release the reiserfs write lock when it is not needed ]
> 
> Cc: Jeff Mahoney <jeffm@suse.com>
> Cc: Chris Mason <chris.mason@oracle.com>
> Cc: Alexander Beregalov <a.beregalov@gmail.com>
> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> ---
>  fs/reiserfs/journal.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
> index 7976d7d..373d080 100644
> --- a/fs/reiserfs/journal.c
> +++ b/fs/reiserfs/journal.c
> @@ -4232,7 +4232,9 @@ static int do_journal_end(struct reiserfs_transaction_handle *th,
>  		next = cn->next;
>  		free_cnode(sb, cn);
>  		cn = next;
> +		reiserfs_write_unlock(sb);
>  		cond_resched();
> +		reiserfs_write_lock(sb);

There's 8 cond_resched()'s in the code - i'd suggest to introduce a 
helper for this - reiserfs_write_cond_resched(sb) or so?

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme
  2009-05-01  5:35 ` [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme Ingo Molnar
@ 2009-05-01 12:18   ` Thomas Meyer
  2009-05-01 14:12     ` Frederic Weisbecker
  0 siblings, 1 reply; 39+ messages in thread
From: Thomas Meyer @ 2009-05-01 12:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, LKML, Jeff Mahoney,
	ReiserFS Development List, Chris Mason, Alexander Beregalov,
	Alessio Igor Bogani, J

Am Freitag, den 01.05.2009, 07:35 +0200 schrieb Ingo Molnar:
> 
> I've pulled it and have also merged -rc4 into the kill-the-BKL tree, 
> which can picked up from here:
> 
>    git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git core/kill-the-BKL
> 
> So for comparative benchmarking, vanilla v2.6.30-rc4 (which has the 
> BKL) can be compared against latest kill-the-BKL

Hi,

Kernel gets soft-locked-up in somewhere in initrd:

See
http://m3y3r.de/bilder/img_0524.jpg
http://m3y3r.de/bilder/img_0525.jpg
http://m3y3r.de/bilder/img_0526.jpg
http://m3y3r.de/bilder/img_0527.jpg
http://m3y3r.de/bilder/img_0528.jpg
http://m3y3r.de/bilder/img_0529.jpg

config is:
http://m3y3r.de/bilder/config-2.6.30-rc4-ktb

greets
thomas



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/6] kill-the-BKL/reiserfs: release the write lock on flush_commit_list()
  2009-05-01  5:42   ` Ingo Molnar
@ 2009-05-01 13:13     ` Frederic Weisbecker
  2009-05-01 13:23       ` Ingo Molnar
  2009-05-01 13:26       ` Chris Mason
  0 siblings, 2 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01 13:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Jeff Mahoney, ReiserFS Development List, Chris Mason,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro

On Fri, May 01, 2009 at 07:42:47AM +0200, Ingo Molnar wrote:
> 
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > flush_commit_list() uses ll_rw_block() to commit the pending log blocks.
> > ll_rw_block() might sleep, and the bkl was released at this point. Then
> > we can also relax the write lock at this point.
> > 
> > [ Impact: release the reiserfs write lock when it is not needed ]
> > 
> > Cc: Jeff Mahoney <jeffm@suse.com>
> > Cc: Chris Mason <chris.mason@oracle.com>
> > Cc: Alexander Beregalov <a.beregalov@gmail.com>
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > ---
> >  fs/reiserfs/journal.c |    7 +++++--
> >  1 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
> > index 373d080..b1ebd5a 100644
> > --- a/fs/reiserfs/journal.c
> > +++ b/fs/reiserfs/journal.c
> > @@ -1120,8 +1120,11 @@ static int flush_commit_list(struct super_block *s,
> >  		    SB_ONDISK_JOURNAL_SIZE(s);
> >  		tbh = journal_find_get_block(s, bn);
> >  		if (tbh) {
> > -			if (buffer_dirty(tbh))
> > -			    ll_rw_block(WRITE, 1, &tbh) ;
> > +			if (buffer_dirty(tbh)) {
> > +		            reiserfs_write_unlock(s);
> > +			    ll_rw_block(WRITE, 1, &tbh);
> > +			    reiserfs_write_lock(s);
> > +			}
> >  			put_bh(tbh) ;
> >  		}
> >  	}
> 
> there's 7 other instances of ll_rw_block():
> 
> fs/reiserfs/journal.c-			spin_unlock(lock);
> fs/reiserfs/journal.c:			ll_rw_block(WRITE, 1, &bh);
> fs/reiserfs/journal.c-			spin_lock(lock);
> --
> fs/reiserfs/journal.c-		            reiserfs_write_unlock(s);
> fs/reiserfs/journal.c:			    ll_rw_block(WRITE, 1, &tbh);
> fs/reiserfs/journal.c-			    reiserfs_write_lock(s);
> --
> fs/reiserfs/journal.c-	/* read in the log blocks, memcpy to the corresponding real block */
> fs/reiserfs/journal.c:	ll_rw_block(READ, get_desc_trans_len(desc), log_blocks);
> fs/reiserfs/journal.c-	for (i = 0; i < get_desc_trans_len(desc); i++) {
> --
> fs/reiserfs/journal.c-		set_buffer_dirty(real_blocks[i]);
> fs/reiserfs/journal.c:		ll_rw_block(SWRITE, 1, real_blocks + i);
> fs/reiserfs/journal.c-	}
> --
> fs/reiserfs/journal.c-	}
> fs/reiserfs/journal.c:	ll_rw_block(READ, j, bhlist);
> fs/reiserfs/journal.c-	for (i = 1; i < j; i++)
> --
> fs/reiserfs/stree.c-		if (!buffer_uptodate(bh[j]))
> fs/reiserfs/stree.c:			ll_rw_block(READA, 1, bh + j);
> fs/reiserfs/stree.c-		brelse(bh[j]);
> --
> fs/reiserfs/stree.c-						    reada_blocks, reada_count);
> fs/reiserfs/stree.c:			ll_rw_block(READ, 1, &bh);
> fs/reiserfs/stree.c-			reiserfs_write_unlock(sb);
> --
> fs/reiserfs/super.c-{
> fs/reiserfs/super.c:	ll_rw_block(READ, 1, &(SB_BUFFER_WITH_SB(s)));
> fs/reiserfs/super.c-	reiserfs_write_unlock(s);
> 
> in particular the second stree.c one and the super.c has a 
> write-unlock straight before the lock-drop.
> 
> I think the stree.c unlock could be moved to before the 
> ll_rw_block() call straight away.


Indeed.


> 
> The super.c one needs more care: first put &(SB_BUFFER_WITH_SB(s)) 
> into a local variable, then unlock the wite-lock, then call 
> ll_rw_block(). (This is important because &(SB_BUFFER_WITH_SB(s)) is 
> global filesystem state that has to be read with the lock held.)


Indeed &(SB_BUFFER_WITH_SB(s)) is a pointer to blocks that
reflect the state of the filesystem but it was already not
safe on the old code.

ll_rw_block() may sleep, and wait_on_buffer() too. And this
pointer could have changed already during these sleeps.

If we put it in a local variable, it prevents from a change of the
pointer value, but not from its content, like in the older scheme.

And I guess this pointer is unlikely to change, this is about the superblock
and the bitmap...

But I'm not sure. I guess it's indeed better to put it in a local variable.

> ll_rw_block() generally always has a chance to block (especially on 
> READ) - so the other places could be converted to drop the 
> write-lock too. Most seem straightforward - some need similar 
> local-variable treatment as super.c.


Ok, thanks!


> 	Ingo


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] kill-the-BKL/reiserfs: release the write lock inside reiserfs_read_bitmap_block()
  2009-05-01  5:47   ` Ingo Molnar
@ 2009-05-01 13:19     ` Frederic Weisbecker
  2009-05-01 13:30       ` Chris Mason
  0 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01 13:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Jeff Mahoney, ReiserFS Development List, Chris Mason,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro

On Fri, May 01, 2009 at 07:47:34AM +0200, Ingo Molnar wrote:
> 
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > reiserfs_read_bitmap_block() uses sb_bread() to read the bitmap block. This
> > helper might sleep.
> > 
> > Then, when the bkl was used, it was released at this point. We can then
> > relax the write lock too here.
> > 
> > [ Impact: release the reiserfs write lock when it is not needed ]
> > 
> > Cc: Jeff Mahoney <jeffm@suse.com>
> > Cc: Chris Mason <chris.mason@oracle.com>
> > Cc: Alexander Beregalov <a.beregalov@gmail.com>
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > ---
> >  fs/reiserfs/bitmap.c |    2 ++
> >  1 files changed, 2 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/reiserfs/bitmap.c b/fs/reiserfs/bitmap.c
> > index 1470334..6854957 100644
> > --- a/fs/reiserfs/bitmap.c
> > +++ b/fs/reiserfs/bitmap.c
> > @@ -1249,7 +1249,9 @@ struct buffer_head *reiserfs_read_bitmap_block(struct super_block *sb,
> >  	else if (bitmap == 0)
> >  		block = (REISERFS_DISK_OFFSET_IN_BYTES >> sb->s_blocksize_bits) + 1;
> >  
> > +	reiserfs_write_unlock(sb);
> >  	bh = sb_bread(sb, block);
> > +	reiserfs_write_lock(sb);
> >  	if (bh == NULL)
> >  		reiserfs_warning(sb, "sh-2029: %s: bitmap block (#%u) "
> >  		                 "reading failed", __func__, block);
> 
> Note, there's a side-effect here: the access to sb->b_blocksize is 
> moved outside of the lock. Previously it was accessed via the BKL. 
> On a mounted filesystem sb->b_blocksize is not supposed to change, 
> so it's probably not an issue - but wanted to mention it.
> 
> 	Ingo


Indeed. Well I guess it can't be dynamically changed.
This is something that can be chosen with mkfs on filesystem creation
but I guess it can't be changed, at least not while the filesystem
is mounted. I hope...

Frederic.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/6] kill-the-BKL/reiserfs: release the write lock on flush_commit_list()
  2009-05-01 13:13     ` Frederic Weisbecker
@ 2009-05-01 13:23       ` Ingo Molnar
  2009-05-01 13:26       ` Chris Mason
  1 sibling, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2009-05-01 13:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Jeff Mahoney, ReiserFS Development List, Chris Mason,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> On Fri, May 01, 2009 at 07:42:47AM +0200, Ingo Molnar wrote:
> > 
> > * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > 
> > > flush_commit_list() uses ll_rw_block() to commit the pending log blocks.
> > > ll_rw_block() might sleep, and the bkl was released at this point. Then
> > > we can also relax the write lock at this point.
> > > 
> > > [ Impact: release the reiserfs write lock when it is not needed ]
> > > 
> > > Cc: Jeff Mahoney <jeffm@suse.com>
> > > Cc: Chris Mason <chris.mason@oracle.com>
> > > Cc: Alexander Beregalov <a.beregalov@gmail.com>
> > > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > > ---
> > >  fs/reiserfs/journal.c |    7 +++++--
> > >  1 files changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
> > > index 373d080..b1ebd5a 100644
> > > --- a/fs/reiserfs/journal.c
> > > +++ b/fs/reiserfs/journal.c
> > > @@ -1120,8 +1120,11 @@ static int flush_commit_list(struct super_block *s,
> > >  		    SB_ONDISK_JOURNAL_SIZE(s);
> > >  		tbh = journal_find_get_block(s, bn);
> > >  		if (tbh) {
> > > -			if (buffer_dirty(tbh))
> > > -			    ll_rw_block(WRITE, 1, &tbh) ;
> > > +			if (buffer_dirty(tbh)) {
> > > +		            reiserfs_write_unlock(s);
> > > +			    ll_rw_block(WRITE, 1, &tbh);
> > > +			    reiserfs_write_lock(s);
> > > +			}
> > >  			put_bh(tbh) ;
> > >  		}
> > >  	}
> > 
> > there's 7 other instances of ll_rw_block():
> > 
> > fs/reiserfs/journal.c-			spin_unlock(lock);
> > fs/reiserfs/journal.c:			ll_rw_block(WRITE, 1, &bh);
> > fs/reiserfs/journal.c-			spin_lock(lock);
> > --
> > fs/reiserfs/journal.c-		            reiserfs_write_unlock(s);
> > fs/reiserfs/journal.c:			    ll_rw_block(WRITE, 1, &tbh);
> > fs/reiserfs/journal.c-			    reiserfs_write_lock(s);
> > --
> > fs/reiserfs/journal.c-	/* read in the log blocks, memcpy to the corresponding real block */
> > fs/reiserfs/journal.c:	ll_rw_block(READ, get_desc_trans_len(desc), log_blocks);
> > fs/reiserfs/journal.c-	for (i = 0; i < get_desc_trans_len(desc); i++) {
> > --
> > fs/reiserfs/journal.c-		set_buffer_dirty(real_blocks[i]);
> > fs/reiserfs/journal.c:		ll_rw_block(SWRITE, 1, real_blocks + i);
> > fs/reiserfs/journal.c-	}
> > --
> > fs/reiserfs/journal.c-	}
> > fs/reiserfs/journal.c:	ll_rw_block(READ, j, bhlist);
> > fs/reiserfs/journal.c-	for (i = 1; i < j; i++)
> > --
> > fs/reiserfs/stree.c-		if (!buffer_uptodate(bh[j]))
> > fs/reiserfs/stree.c:			ll_rw_block(READA, 1, bh + j);
> > fs/reiserfs/stree.c-		brelse(bh[j]);
> > --
> > fs/reiserfs/stree.c-						    reada_blocks, reada_count);
> > fs/reiserfs/stree.c:			ll_rw_block(READ, 1, &bh);
> > fs/reiserfs/stree.c-			reiserfs_write_unlock(sb);
> > --
> > fs/reiserfs/super.c-{
> > fs/reiserfs/super.c:	ll_rw_block(READ, 1, &(SB_BUFFER_WITH_SB(s)));
> > fs/reiserfs/super.c-	reiserfs_write_unlock(s);
> > 
> > in particular the second stree.c one and the super.c has a 
> > write-unlock straight before the lock-drop.
> > 
> > I think the stree.c unlock could be moved to before the 
> > ll_rw_block() call straight away.
> 
> 
> Indeed.
> 
> 
> > 
> > The super.c one needs more care: first put &(SB_BUFFER_WITH_SB(s)) 
> > into a local variable, then unlock the wite-lock, then call 
> > ll_rw_block(). (This is important because &(SB_BUFFER_WITH_SB(s)) is 
> > global filesystem state that has to be read with the lock held.)
> 
> 
> Indeed &(SB_BUFFER_WITH_SB(s)) is a pointer to blocks that
> reflect the state of the filesystem but it was already not
> safe on the old code.
> 
> ll_rw_block() may sleep, and wait_on_buffer() too. And this
> pointer could have changed already during these sleeps.

No, it was safe prior. This was the prior code:

 fs/reiserfs/super.c:	ll_rw_block(READ, 1, &(SB_BUFFER_WITH_SB(s)));

The argument is passed to ll_rw_block _before_ the scheduling. So 
the dereference always happens with the BKL held, and is safe.

It is true that ll_rw_block() can sleep (and will most likely sleep 
for a READ command), but that's not the issue: the issue is the 
former dereference which can now get out from under the lock. (it 
might still be safe in special circumstances - but has to be 
reviewed carefully.)

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/6] kill-the-BKL/reiserfs: release the write lock inside get_neighbors()
  2009-05-01  5:51   ` Ingo Molnar
@ 2009-05-01 13:25     ` Frederic Weisbecker
  2009-05-01 13:29       ` Chris Mason
  0 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01 13:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Jeff Mahoney, ReiserFS Development List, Chris Mason,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro

On Fri, May 01, 2009 at 07:51:35AM +0200, Ingo Molnar wrote:
> 
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > get_neighbors() is used to get the left and/or right blocks 
> > against a given one in order to balance a tree.
> > 
> > sb_bread() is used to read the buffer of these neighors blocks and 
> > while it waits for this operation, it might sleep.
> > 
> > The bkl was released at this point, and then we can also release 
> > the write lock before calling sb_bread().
> > 
> > This is safe because if the filesystem is changed after this lock 
> > release, the function returns REPEAT_SEARCH (aka SCHEDULE_OCCURRED 
> > in the function header comments) in order to repeat the neighbhor 
> > research.
> > 
> > [ Impact: release the reiserfs write lock when it is not needed ]
> 
> This should also be safe because under the BKL we _already_ dropped 
> the lock when sb_bread() blocked (which it really would in the 
> normal case).
> 
> There's one special case to consider though: sb_read() maps to 
> __bread() which can return without sleeping if the bh is already 
> uptodate. So if the filesystem _knows_ that the bh is already 
> uptodate and holds a reference to it (this is common pattern in 
> filesystems), it can have a locking assumption on that.
> 
> No such assumption seems to be present here though.
> 
> 	Ingo


Yeah, fortunately it doesn't base its check on the state of the
buffer but on the tree number of rebalancing.

But I have to remember this pattern, it could be present elsewhere
in reiserfs.

Thanks.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/6] kill-the-BKL/reiserfs: release the write lock on flush_commit_list()
  2009-05-01 13:13     ` Frederic Weisbecker
  2009-05-01 13:23       ` Ingo Molnar
@ 2009-05-01 13:26       ` Chris Mason
  2009-05-01 13:29         ` Ingo Molnar
  1 sibling, 1 reply; 39+ messages in thread
From: Chris Mason @ 2009-05-01 13:26 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Jeff Mahoney, ReiserFS Development List,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro

On Fri, 2009-05-01 at 15:13 +0200, Frederic Weisbecker wrote:
> On Fri, May 01, 2009 at 07:42:47AM +0200, Ingo Molnar wrote:
> > 
> > * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > 
> > > flush_commit_list() uses ll_rw_block() to commit the pending log blocks.
> > > ll_rw_block() might sleep, and the bkl was released at this point. Then
> > > we can also relax the write lock at this point.
> > > 
> > > [ Impact: release the reiserfs write lock when it is not needed ]
> > > 
> > > Cc: Jeff Mahoney <jeffm@suse.com>
> > > Cc: Chris Mason <chris.mason@oracle.com>
> > > Cc: Alexander Beregalov <a.beregalov@gmail.com>
> > > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > > ---
> > >  fs/reiserfs/journal.c |    7 +++++--
> > >  1 files changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
> > > index 373d080..b1ebd5a 100644
> > > --- a/fs/reiserfs/journal.c
> > > +++ b/fs/reiserfs/journal.c
> > > @@ -1120,8 +1120,11 @@ static int flush_commit_list(struct super_block *s,
> > >  		    SB_ONDISK_JOURNAL_SIZE(s);
> > >  		tbh = journal_find_get_block(s, bn);
> > >  		if (tbh) {
> > > -			if (buffer_dirty(tbh))
> > > -			    ll_rw_block(WRITE, 1, &tbh) ;
> > > +			if (buffer_dirty(tbh)) {
> > > +		            reiserfs_write_unlock(s);
> > > +			    ll_rw_block(WRITE, 1, &tbh);
> > > +			    reiserfs_write_lock(s);
> > > +			}
> > >  			put_bh(tbh) ;
> > >  		}
> > >  	}
> > 
> > there's 7 other instances of ll_rw_block():
> > 
> > fs/reiserfs/journal.c-			spin_unlock(lock);
> > fs/reiserfs/journal.c:			ll_rw_block(WRITE, 1, &bh);
> > fs/reiserfs/journal.c-			spin_lock(lock);
> > --
> > fs/reiserfs/journal.c-		            reiserfs_write_unlock(s);
> > fs/reiserfs/journal.c:			    ll_rw_block(WRITE, 1, &tbh);
> > fs/reiserfs/journal.c-			    reiserfs_write_lock(s);
> > --
> > fs/reiserfs/journal.c-	/* read in the log blocks, memcpy to the corresponding real block */
> > fs/reiserfs/journal.c:	ll_rw_block(READ, get_desc_trans_len(desc), log_blocks);
> > fs/reiserfs/journal.c-	for (i = 0; i < get_desc_trans_len(desc); i++) {
> > --
> > fs/reiserfs/journal.c-		set_buffer_dirty(real_blocks[i]);
> > fs/reiserfs/journal.c:		ll_rw_block(SWRITE, 1, real_blocks + i);
> > fs/reiserfs/journal.c-	}
> > --
> > fs/reiserfs/journal.c-	}
> > fs/reiserfs/journal.c:	ll_rw_block(READ, j, bhlist);
> > fs/reiserfs/journal.c-	for (i = 1; i < j; i++)
> > --
> > fs/reiserfs/stree.c-		if (!buffer_uptodate(bh[j]))
> > fs/reiserfs/stree.c:			ll_rw_block(READA, 1, bh + j);
> > fs/reiserfs/stree.c-		brelse(bh[j]);
> > --
> > fs/reiserfs/stree.c-						    reada_blocks, reada_count);
> > fs/reiserfs/stree.c:			ll_rw_block(READ, 1, &bh);
> > fs/reiserfs/stree.c-			reiserfs_write_unlock(sb);
> > --
> > fs/reiserfs/super.c-{
> > fs/reiserfs/super.c:	ll_rw_block(READ, 1, &(SB_BUFFER_WITH_SB(s)));
> > fs/reiserfs/super.c-	reiserfs_write_unlock(s);
> > 
> > in particular the second stree.c one and the super.c has a 
> > write-unlock straight before the lock-drop.
> > 
> > I think the stree.c unlock could be moved to before the 
> > ll_rw_block() call straight away.
> 
> 
> Indeed.
> 
> 
> > 
> > The super.c one needs more care: first put &(SB_BUFFER_WITH_SB(s)) 
> > into a local variable, then unlock the wite-lock, then call 
> > ll_rw_block(). (This is important because &(SB_BUFFER_WITH_SB(s)) is 
> > global filesystem state that has to be read with the lock held.)
> 
> 
> Indeed &(SB_BUFFER_WITH_SB(s)) is a pointer to blocks that
> reflect the state of the filesystem but it was already not
> safe on the old code.
> 

SB_BUFFER_WITH_SB isn't going to change.  It gets set once during mount
and will return the same buffer from then on.

-chris



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/6] kill-the-BKL/reiserfs: release write lock on fs_changed()
  2009-05-01  6:31   ` Andi Kleen
@ 2009-05-01 13:28     ` Frederic Weisbecker
  2009-05-01 13:44       ` Chris Mason
  0 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01 13:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, LKML, Jeff Mahoney, ReiserFS Development List,
	Chris Mason, Alexander Beregalov, Alessio Igor Bogani,
	Jonathan Corbet, Alexander Viro

On Fri, May 01, 2009 at 08:31:12AM +0200, Andi Kleen wrote:
> Frederic Weisbecker <fweisbec@gmail.com> writes:
> >
> > diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h
> > index 6587b4e..397d281 100644
> > --- a/include/linux/reiserfs_fs.h
> > +++ b/include/linux/reiserfs_fs.h
> > @@ -1302,7 +1302,13 @@ static inline loff_t max_reiserfs_offset(struct inode *inode)
> >  #define get_generation(s) atomic_read (&fs_generation(s))
> >  #define FILESYSTEM_CHANGED_TB(tb)  (get_generation((tb)->tb_sb) != (tb)->fs_gen)
> >  #define __fs_changed(gen,s) (gen != get_generation (s))
> > -#define fs_changed(gen,s) ({cond_resched(); __fs_changed(gen, s);})
> > +#define fs_changed(gen,s)		\
> > +({					\
> > +	reiserfs_write_unlock(s);	\
> > +	cond_resched();			\
> > +	reiserfs_write_lock(s);		\
> 
> Did you try writing that 
> 
>     if (need_resched()) {               \
> 	reiserfs_write_unlock(s);	\
> 	cond_resched();			\  (or schedule(), but cond_resched does a loop)
> 	reiserfs_write_lock(s);		\
>     }				
> 
> ? That might give better performance under load because users will be better
> batched and you don't release the lock unnecessarily in the unloaded case.



Good catch!
And I guess this pattern matches most of the cond_resched()
all over the code (the only condition is that we must already hold
the write lock).

I will merge your idea and Ingo's one, write a
reiserfs_cond_resched() to have a helper which
factorizes this pattern.

Thanks.


 
> -Andi
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/6] kill-the-BKL/reiserfs: release the write lock inside get_neighbors()
  2009-05-01 13:25     ` Frederic Weisbecker
@ 2009-05-01 13:29       ` Chris Mason
  2009-05-01 13:31         ` Ingo Molnar
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Mason @ 2009-05-01 13:29 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Jeff Mahoney, ReiserFS Development List,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro

On Fri, 2009-05-01 at 15:25 +0200, Frederic Weisbecker wrote:
> On Fri, May 01, 2009 at 07:51:35AM +0200, Ingo Molnar wrote:
> > 
> > * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > 
> > > get_neighbors() is used to get the left and/or right blocks 
> > > against a given one in order to balance a tree.
> > > 
> > > sb_bread() is used to read the buffer of these neighors blocks and 
> > > while it waits for this operation, it might sleep.
> > > 
> > > The bkl was released at this point, and then we can also release 
> > > the write lock before calling sb_bread().
> > > 
> > > This is safe because if the filesystem is changed after this lock 
> > > release, the function returns REPEAT_SEARCH (aka SCHEDULE_OCCURRED 
> > > in the function header comments) in order to repeat the neighbhor 
> > > research.
> > > 
> > > [ Impact: release the reiserfs write lock when it is not needed ]
> > 
> > This should also be safe because under the BKL we _already_ dropped 
> > the lock when sb_bread() blocked (which it really would in the 
> > normal case).
> > 
> > There's one special case to consider though: sb_read() maps to 
> > __bread() which can return without sleeping if the bh is already 
> > uptodate. So if the filesystem _knows_ that the bh is already 
> > uptodate and holds a reference to it (this is common pattern in 
> > filesystems), it can have a locking assumption on that.
> > 

sb_bread calls __bread which calls __getblk which always calls
might_sleep() before returning.

So, the unlock isn't adding a schedule that wasn't there before.

-chris



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/6] kill-the-BKL/reiserfs: release the write lock on flush_commit_list()
  2009-05-01 13:26       ` Chris Mason
@ 2009-05-01 13:29         ` Ingo Molnar
  2009-05-01 13:54           ` Chris Mason
  0 siblings, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2009-05-01 13:29 UTC (permalink / raw)
  To: Chris Mason
  Cc: Frederic Weisbecker, LKML, Jeff Mahoney,
	ReiserFS Development List, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro


* Chris Mason <chris.mason@oracle.com> wrote:

> On Fri, 2009-05-01 at 15:13 +0200, Frederic Weisbecker wrote:
> > On Fri, May 01, 2009 at 07:42:47AM +0200, Ingo Molnar wrote:
> > > 
> > > * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > > 
> > > > flush_commit_list() uses ll_rw_block() to commit the pending log blocks.
> > > > ll_rw_block() might sleep, and the bkl was released at this point. Then
> > > > we can also relax the write lock at this point.
> > > > 
> > > > [ Impact: release the reiserfs write lock when it is not needed ]
> > > > 
> > > > Cc: Jeff Mahoney <jeffm@suse.com>
> > > > Cc: Chris Mason <chris.mason@oracle.com>
> > > > Cc: Alexander Beregalov <a.beregalov@gmail.com>
> > > > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > > > ---
> > > >  fs/reiserfs/journal.c |    7 +++++--
> > > >  1 files changed, 5 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
> > > > index 373d080..b1ebd5a 100644
> > > > --- a/fs/reiserfs/journal.c
> > > > +++ b/fs/reiserfs/journal.c
> > > > @@ -1120,8 +1120,11 @@ static int flush_commit_list(struct super_block *s,
> > > >  		    SB_ONDISK_JOURNAL_SIZE(s);
> > > >  		tbh = journal_find_get_block(s, bn);
> > > >  		if (tbh) {
> > > > -			if (buffer_dirty(tbh))
> > > > -			    ll_rw_block(WRITE, 1, &tbh) ;
> > > > +			if (buffer_dirty(tbh)) {
> > > > +		            reiserfs_write_unlock(s);
> > > > +			    ll_rw_block(WRITE, 1, &tbh);
> > > > +			    reiserfs_write_lock(s);
> > > > +			}
> > > >  			put_bh(tbh) ;
> > > >  		}
> > > >  	}
> > > 
> > > there's 7 other instances of ll_rw_block():
> > > 
> > > fs/reiserfs/journal.c-			spin_unlock(lock);
> > > fs/reiserfs/journal.c:			ll_rw_block(WRITE, 1, &bh);
> > > fs/reiserfs/journal.c-			spin_lock(lock);
> > > --
> > > fs/reiserfs/journal.c-		            reiserfs_write_unlock(s);
> > > fs/reiserfs/journal.c:			    ll_rw_block(WRITE, 1, &tbh);
> > > fs/reiserfs/journal.c-			    reiserfs_write_lock(s);
> > > --
> > > fs/reiserfs/journal.c-	/* read in the log blocks, memcpy to the corresponding real block */
> > > fs/reiserfs/journal.c:	ll_rw_block(READ, get_desc_trans_len(desc), log_blocks);
> > > fs/reiserfs/journal.c-	for (i = 0; i < get_desc_trans_len(desc); i++) {
> > > --
> > > fs/reiserfs/journal.c-		set_buffer_dirty(real_blocks[i]);
> > > fs/reiserfs/journal.c:		ll_rw_block(SWRITE, 1, real_blocks + i);
> > > fs/reiserfs/journal.c-	}
> > > --
> > > fs/reiserfs/journal.c-	}
> > > fs/reiserfs/journal.c:	ll_rw_block(READ, j, bhlist);
> > > fs/reiserfs/journal.c-	for (i = 1; i < j; i++)
> > > --
> > > fs/reiserfs/stree.c-		if (!buffer_uptodate(bh[j]))
> > > fs/reiserfs/stree.c:			ll_rw_block(READA, 1, bh + j);
> > > fs/reiserfs/stree.c-		brelse(bh[j]);
> > > --
> > > fs/reiserfs/stree.c-						    reada_blocks, reada_count);
> > > fs/reiserfs/stree.c:			ll_rw_block(READ, 1, &bh);
> > > fs/reiserfs/stree.c-			reiserfs_write_unlock(sb);
> > > --
> > > fs/reiserfs/super.c-{
> > > fs/reiserfs/super.c:	ll_rw_block(READ, 1, &(SB_BUFFER_WITH_SB(s)));
> > > fs/reiserfs/super.c-	reiserfs_write_unlock(s);
> > > 
> > > in particular the second stree.c one and the super.c has a 
> > > write-unlock straight before the lock-drop.
> > > 
> > > I think the stree.c unlock could be moved to before the 
> > > ll_rw_block() call straight away.
> > 
> > 
> > Indeed.
> > 
> > 
> > > 
> > > The super.c one needs more care: first put &(SB_BUFFER_WITH_SB(s)) 
> > > into a local variable, then unlock the wite-lock, then call 
> > > ll_rw_block(). (This is important because &(SB_BUFFER_WITH_SB(s)) is 
> > > global filesystem state that has to be read with the lock held.)
> > 
> > 
> > Indeed &(SB_BUFFER_WITH_SB(s)) is a pointer to blocks that
> > reflect the state of the filesystem but it was already not
> > safe on the old code.
> > 
> 
> SB_BUFFER_WITH_SB isn't going to change.  It gets set once during 
> mount and will return the same buffer from then on.

Are we holding a permanent reference to it after the first read? If 
yes then any bread done on an uptodate bh will return immediately 
without scheduling.

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] kill-the-BKL/reiserfs: release the write lock inside reiserfs_read_bitmap_block()
  2009-05-01 13:19     ` Frederic Weisbecker
@ 2009-05-01 13:30       ` Chris Mason
  2009-05-01 13:51         ` Frederic Weisbecker
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Mason @ 2009-05-01 13:30 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Ingo Molnar, LKML, Jeff Mahoney, ReiserFS Development List,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro

On Fri, 2009-05-01 at 15:19 +0200, Frederic Weisbecker wrote:
> On Fri, May 01, 2009 at 07:47:34AM +0200, Ingo Molnar wrote:
> > 
> > * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > 
> > > reiserfs_read_bitmap_block() uses sb_bread() to read the bitmap block. This
> > > helper might sleep.
> > > 
> > > Then, when the bkl was used, it was released at this point. We can then
> > > relax the write lock too here.
> > > 
> > > [ Impact: release the reiserfs write lock when it is not needed ]
> > > 
> > > Cc: Jeff Mahoney <jeffm@suse.com>
> > > Cc: Chris Mason <chris.mason@oracle.com>
> > > Cc: Alexander Beregalov <a.beregalov@gmail.com>
> > > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > > ---
> > >  fs/reiserfs/bitmap.c |    2 ++
> > >  1 files changed, 2 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/fs/reiserfs/bitmap.c b/fs/reiserfs/bitmap.c
> > > index 1470334..6854957 100644
> > > --- a/fs/reiserfs/bitmap.c
> > > +++ b/fs/reiserfs/bitmap.c
> > > @@ -1249,7 +1249,9 @@ struct buffer_head *reiserfs_read_bitmap_block(struct super_block *sb,
> > >  	else if (bitmap == 0)
> > >  		block = (REISERFS_DISK_OFFSET_IN_BYTES >> sb->s_blocksize_bits) + 1;
> > >  
> > > +	reiserfs_write_unlock(sb);
> > >  	bh = sb_bread(sb, block);
> > > +	reiserfs_write_lock(sb);
> > >  	if (bh == NULL)
> > >  		reiserfs_warning(sb, "sh-2029: %s: bitmap block (#%u) "
> > >  		                 "reading failed", __func__, block);
> > 
> > Note, there's a side-effect here: the access to sb->b_blocksize is 
> > moved outside of the lock. Previously it was accessed via the BKL. 
> > On a mounted filesystem sb->b_blocksize is not supposed to change, 
> > so it's probably not an issue - but wanted to mention it.

> Indeed. Well I guess it can't be dynamically changed.
> This is something that can be chosen with mkfs on filesystem creation
> but I guess it can't be changed, at least not while the filesystem
> is mounted. I hope...

The blocksize should only be changing very early in the mount.  It is
basically:

set the block size to something
read the super block
read the block size from the super block
set the block size to the correct value
use it from then on.

This should all be done with by the time we read bitmaps.

-chris



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 4/6] kill-the-BKL/reiserfs: release the write lock inside get_neighbors()
  2009-05-01 13:29       ` Chris Mason
@ 2009-05-01 13:31         ` Ingo Molnar
  0 siblings, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2009-05-01 13:31 UTC (permalink / raw)
  To: Chris Mason
  Cc: Frederic Weisbecker, LKML, Jeff Mahoney,
	ReiserFS Development List, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro


* Chris Mason <chris.mason@oracle.com> wrote:

> On Fri, 2009-05-01 at 15:25 +0200, Frederic Weisbecker wrote:
> > On Fri, May 01, 2009 at 07:51:35AM +0200, Ingo Molnar wrote:
> > > 
> > > * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > > 
> > > > get_neighbors() is used to get the left and/or right blocks 
> > > > against a given one in order to balance a tree.
> > > > 
> > > > sb_bread() is used to read the buffer of these neighors blocks and 
> > > > while it waits for this operation, it might sleep.
> > > > 
> > > > The bkl was released at this point, and then we can also release 
> > > > the write lock before calling sb_bread().
> > > > 
> > > > This is safe because if the filesystem is changed after this lock 
> > > > release, the function returns REPEAT_SEARCH (aka SCHEDULE_OCCURRED 
> > > > in the function header comments) in order to repeat the neighbhor 
> > > > research.
> > > > 
> > > > [ Impact: release the reiserfs write lock when it is not needed ]
> > > 
> > > This should also be safe because under the BKL we _already_ dropped 
> > > the lock when sb_bread() blocked (which it really would in the 
> > > normal case).
> > > 
> > > There's one special case to consider though: sb_read() maps to 
> > > __bread() which can return without sleeping if the bh is already 
> > > uptodate. So if the filesystem _knows_ that the bh is already 
> > > uptodate and holds a reference to it (this is common pattern in 
> > > filesystems), it can have a locking assumption on that.
> > > 
> 
> sb_bread calls __bread which calls __getblk which always calls 
> might_sleep() before returning. So, the unlock isn't adding a 
> schedule that wasn't there before.

ah, and might_sleep() can schedule under CONFIG_PREEMPT_VOLUNTARY=y. 
Good point.

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/6] kill-the-BKL/reiserfs: release the write lock before rescheduling on do_journal_end()
  2009-05-01  7:09   ` Ingo Molnar
@ 2009-05-01 13:31     ` Frederic Weisbecker
  2009-05-01 22:20       ` Frederic Weisbecker
  0 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01 13:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Jeff Mahoney, ReiserFS Development List, Chris Mason,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro

On Fri, May 01, 2009 at 09:09:11AM +0200, Ingo Molnar wrote:
> 
> * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > When do_journal_end() copies data to the journal blocks buffers in memory,
> > it reschedules if needed between each block copied and dirtyfied.
> > 
> > We can also release the write lock at this rescheduling stage,
> > like did the bkl implicitly.
> > 
> > [ Impact: release the reiserfs write lock when it is not needed ]
> > 
> > Cc: Jeff Mahoney <jeffm@suse.com>
> > Cc: Chris Mason <chris.mason@oracle.com>
> > Cc: Alexander Beregalov <a.beregalov@gmail.com>
> > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > ---
> >  fs/reiserfs/journal.c |    2 ++
> >  1 files changed, 2 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
> > index 7976d7d..373d080 100644
> > --- a/fs/reiserfs/journal.c
> > +++ b/fs/reiserfs/journal.c
> > @@ -4232,7 +4232,9 @@ static int do_journal_end(struct reiserfs_transaction_handle *th,
> >  		next = cn->next;
> >  		free_cnode(sb, cn);
> >  		cn = next;
> > +		reiserfs_write_unlock(sb);
> >  		cond_resched();
> > +		reiserfs_write_lock(sb);
> 
> There's 8 cond_resched()'s in the code - i'd suggest to introduce a 
> helper for this - reiserfs_write_cond_resched(sb) or so?
> 
> 	Ingo

Indeed, that + the pattern suggested by Al.

Thanks.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/6] kill-the-BKL/reiserfs: release write lock on fs_changed()
  2009-05-01 13:28     ` Frederic Weisbecker
@ 2009-05-01 13:44       ` Chris Mason
  2009-05-01 14:01         ` Frederic Weisbecker
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Mason @ 2009-05-01 13:44 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andi Kleen, Ingo Molnar, LKML, Jeff Mahoney,
	ReiserFS Development List, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro

On Fri, 2009-05-01 at 15:28 +0200, Frederic Weisbecker wrote:
> On Fri, May 01, 2009 at 08:31:12AM +0200, Andi Kleen wrote:
> > Frederic Weisbecker <fweisbec@gmail.com> writes:
> > >
> > > diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h
> > > index 6587b4e..397d281 100644
> > > --- a/include/linux/reiserfs_fs.h
> > > +++ b/include/linux/reiserfs_fs.h
> > > @@ -1302,7 +1302,13 @@ static inline loff_t max_reiserfs_offset(struct inode *inode)
> > >  #define get_generation(s) atomic_read (&fs_generation(s))
> > >  #define FILESYSTEM_CHANGED_TB(tb)  (get_generation((tb)->tb_sb) != (tb)->fs_gen)
> > >  #define __fs_changed(gen,s) (gen != get_generation (s))
> > > -#define fs_changed(gen,s) ({cond_resched(); __fs_changed(gen, s);})
> > > +#define fs_changed(gen,s)		\
> > > +({					\
> > > +	reiserfs_write_unlock(s);	\
> > > +	cond_resched();			\
> > > +	reiserfs_write_lock(s);		\
> > 
> > Did you try writing that 
> > 
> >     if (need_resched()) {               \
> > 	reiserfs_write_unlock(s);	\
> > 	cond_resched();			\  (or schedule(), but cond_resched does a loop)
> > 	reiserfs_write_lock(s);		\
> >     }				
> > 
> > ? That might give better performance under load because users will be better
> > batched and you don't release the lock unnecessarily in the unloaded case.
> 
> 
> 
> Good catch!
> And I guess this pattern matches most of the cond_resched()
> all over the code (the only condition is that we must already hold
> the write lock).
> 
> I will merge your idea and Ingo's one, write a
> reiserfs_cond_resched() to have a helper which
> factorizes this pattern.

The pattern you'll find goes like this:

lock_kernel()
do some work
do something that might schedule
run fs_changed(), fixup as required.

In your setup it is translating to:

reiserfs_write_lock(s)
do some work
reiserfs_write_unlock(s)

do something that might schedule

reiserfs_write_lock(s)
if (need_resched()) {
    reiserfs_write_unlock(s)
    cond_resched()
    reiserfs_write_lock(s)
}

if (__fs_changed()) fixup as required

You'll also find that item_moved is similar to __fs_changed() but more
fine grained.

One easy optimization is to make an fs_changed_relock()

static inline int fs_changed_relock(gen, s) {
	cond_resched();
	reiserfs_write_lock(s)
	return __fs_changed(gen, s)
}

Another cause of scheduling is going to be reiserfs_prepare_for_journal.
This function gets called before we modify a metadata buffer and it
waits for IO to finish.

Not sure if your patch series already found it, but if you change this:

int reiserfs_prepare_for_journal(struct super_block *sb,
                                 struct buffer_head *bh, int wait)
{
        PROC_INFO_INC(sb, journal.prepare);

        if (!trylock_buffer(bh)) {
                if (!wait)
                        return 0;
                lock_buffer(bh);
        }

Into:

	if (!trylock_buffer(bh)) {
		if (!wait)
			return 0;
		reiserfs_write_unlock(s);
		wait_on_buffer(bh);
		reiserfs_write_lock(s);
		lock_buffer(bh);
	}

You'll catch a big cause of waiting for the disk with the lock held.

-chris



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 5/6] kill-the-BKL/reiserfs: release the write lock inside reiserfs_read_bitmap_block()
  2009-05-01 13:30       ` Chris Mason
@ 2009-05-01 13:51         ` Frederic Weisbecker
  0 siblings, 0 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01 13:51 UTC (permalink / raw)
  To: Chris Mason
  Cc: Ingo Molnar, LKML, Jeff Mahoney, ReiserFS Development List,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro

On Fri, May 01, 2009 at 09:30:51AM -0400, Chris Mason wrote:
> On Fri, 2009-05-01 at 15:19 +0200, Frederic Weisbecker wrote:
> > On Fri, May 01, 2009 at 07:47:34AM +0200, Ingo Molnar wrote:
> > > 
> > > * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > > 
> > > > reiserfs_read_bitmap_block() uses sb_bread() to read the bitmap block. This
> > > > helper might sleep.
> > > > 
> > > > Then, when the bkl was used, it was released at this point. We can then
> > > > relax the write lock too here.
> > > > 
> > > > [ Impact: release the reiserfs write lock when it is not needed ]
> > > > 
> > > > Cc: Jeff Mahoney <jeffm@suse.com>
> > > > Cc: Chris Mason <chris.mason@oracle.com>
> > > > Cc: Alexander Beregalov <a.beregalov@gmail.com>
> > > > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > > > ---
> > > >  fs/reiserfs/bitmap.c |    2 ++
> > > >  1 files changed, 2 insertions(+), 0 deletions(-)
> > > > 
> > > > diff --git a/fs/reiserfs/bitmap.c b/fs/reiserfs/bitmap.c
> > > > index 1470334..6854957 100644
> > > > --- a/fs/reiserfs/bitmap.c
> > > > +++ b/fs/reiserfs/bitmap.c
> > > > @@ -1249,7 +1249,9 @@ struct buffer_head *reiserfs_read_bitmap_block(struct super_block *sb,
> > > >  	else if (bitmap == 0)
> > > >  		block = (REISERFS_DISK_OFFSET_IN_BYTES >> sb->s_blocksize_bits) + 1;
> > > >  
> > > > +	reiserfs_write_unlock(sb);
> > > >  	bh = sb_bread(sb, block);
> > > > +	reiserfs_write_lock(sb);
> > > >  	if (bh == NULL)
> > > >  		reiserfs_warning(sb, "sh-2029: %s: bitmap block (#%u) "
> > > >  		                 "reading failed", __func__, block);
> > > 
> > > Note, there's a side-effect here: the access to sb->b_blocksize is 
> > > moved outside of the lock. Previously it was accessed via the BKL. 
> > > On a mounted filesystem sb->b_blocksize is not supposed to change, 
> > > so it's probably not an issue - but wanted to mention it.
> 
> > Indeed. Well I guess it can't be dynamically changed.
> > This is something that can be chosen with mkfs on filesystem creation
> > but I guess it can't be changed, at least not while the filesystem
> > is mounted. I hope...
> 
> The blocksize should only be changing very early in the mount.  It is
> basically:
> 
> set the block size to something
> read the super block
> read the block size from the super block
> set the block size to the correct value
> use it from then on.
> 
> This should all be done with by the time we read bitmaps.
> 
> -chris


Ok, thanks.

So I guess we can keep it as is without a local variable.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 6/6] kill-the-BKL/reiserfs: release the write lock on flush_commit_list()
  2009-05-01 13:29         ` Ingo Molnar
@ 2009-05-01 13:54           ` Chris Mason
  0 siblings, 0 replies; 39+ messages in thread
From: Chris Mason @ 2009-05-01 13:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, LKML, Jeff Mahoney,
	ReiserFS Development List, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro

On Fri, 2009-05-01 at 15:29 +0200, Ingo Molnar wrote:
> * Chris Mason <chris.mason@oracle.com> wrote:
> 
> > On Fri, 2009-05-01 at 15:13 +0200, Frederic Weisbecker wrote:
> > > On Fri, May 01, 2009 at 07:42:47AM +0200, Ingo Molnar wrote:
> > > > 
> > > > * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > > > 
> > > > > flush_commit_list() uses ll_rw_block() to commit the pending log blocks.
> > > > > ll_rw_block() might sleep, and the bkl was released at this point. Then
> > > > > we can also relax the write lock at this point.
> > > > > 
> > > > > [ Impact: release the reiserfs write lock when it is not needed ]
> > > > > 
> > > > > Cc: Jeff Mahoney <jeffm@suse.com>
> > > > > Cc: Chris Mason <chris.mason@oracle.com>
> > > > > Cc: Alexander Beregalov <a.beregalov@gmail.com>
> > > > > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > > > > ---
> > > > >  fs/reiserfs/journal.c |    7 +++++--
> > > > >  1 files changed, 5 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
> > > > > index 373d080..b1ebd5a 100644
> > > > > --- a/fs/reiserfs/journal.c
> > > > > +++ b/fs/reiserfs/journal.c
> > > > > @@ -1120,8 +1120,11 @@ static int flush_commit_list(struct super_block *s,
> > > > >  		    SB_ONDISK_JOURNAL_SIZE(s);
> > > > >  		tbh = journal_find_get_block(s, bn);
> > > > >  		if (tbh) {
> > > > > -			if (buffer_dirty(tbh))
> > > > > -			    ll_rw_block(WRITE, 1, &tbh) ;
> > > > > +			if (buffer_dirty(tbh)) {
> > > > > +		            reiserfs_write_unlock(s);
> > > > > +			    ll_rw_block(WRITE, 1, &tbh);
> > > > > +			    reiserfs_write_lock(s);
> > > > > +			}
> > > > >  			put_bh(tbh) ;
> > > > >  		}
> > > > >  	}
> > > > 
> > > > there's 7 other instances of ll_rw_block():
> > > > 
> > > > fs/reiserfs/journal.c-			spin_unlock(lock);
> > > > fs/reiserfs/journal.c:			ll_rw_block(WRITE, 1, &bh);
> > > > fs/reiserfs/journal.c-			spin_lock(lock);
> > > > --
> > > > fs/reiserfs/journal.c-		            reiserfs_write_unlock(s);
> > > > fs/reiserfs/journal.c:			    ll_rw_block(WRITE, 1, &tbh);
> > > > fs/reiserfs/journal.c-			    reiserfs_write_lock(s);
> > > > --
> > > > fs/reiserfs/journal.c-	/* read in the log blocks, memcpy to the corresponding real block */
> > > > fs/reiserfs/journal.c:	ll_rw_block(READ, get_desc_trans_len(desc), log_blocks);
> > > > fs/reiserfs/journal.c-	for (i = 0; i < get_desc_trans_len(desc); i++) {
> > > > --
> > > > fs/reiserfs/journal.c-		set_buffer_dirty(real_blocks[i]);
> > > > fs/reiserfs/journal.c:		ll_rw_block(SWRITE, 1, real_blocks + i);
> > > > fs/reiserfs/journal.c-	}
> > > > --
> > > > fs/reiserfs/journal.c-	}
> > > > fs/reiserfs/journal.c:	ll_rw_block(READ, j, bhlist);
> > > > fs/reiserfs/journal.c-	for (i = 1; i < j; i++)
> > > > --
> > > > fs/reiserfs/stree.c-		if (!buffer_uptodate(bh[j]))
> > > > fs/reiserfs/stree.c:			ll_rw_block(READA, 1, bh + j);
> > > > fs/reiserfs/stree.c-		brelse(bh[j]);
> > > > --
> > > > fs/reiserfs/stree.c-						    reada_blocks, reada_count);
> > > > fs/reiserfs/stree.c:			ll_rw_block(READ, 1, &bh);
> > > > fs/reiserfs/stree.c-			reiserfs_write_unlock(sb);
> > > > --
> > > > fs/reiserfs/super.c-{
> > > > fs/reiserfs/super.c:	ll_rw_block(READ, 1, &(SB_BUFFER_WITH_SB(s)));
> > > > fs/reiserfs/super.c-	reiserfs_write_unlock(s);
> > > > 
> > > > in particular the second stree.c one and the super.c has a 
> > > > write-unlock straight before the lock-drop.
> > > > 
> > > > I think the stree.c unlock could be moved to before the 
> > > > ll_rw_block() call straight away.
> > > 
> > > 
> > > Indeed.
> > > 
> > > 
> > > > 
> > > > The super.c one needs more care: first put &(SB_BUFFER_WITH_SB(s)) 
> > > > into a local variable, then unlock the wite-lock, then call 
> > > > ll_rw_block(). (This is important because &(SB_BUFFER_WITH_SB(s)) is 
> > > > global filesystem state that has to be read with the lock held.)
> > > 
> > > 
> > > Indeed &(SB_BUFFER_WITH_SB(s)) is a pointer to blocks that
> > > reflect the state of the filesystem but it was already not
> > > safe on the old code.
> > > 
> > 
> > SB_BUFFER_WITH_SB isn't going to change.  It gets set once during 
> > mount and will return the same buffer from then on.
> 
> Are we holding a permanent reference to it after the first read? If 
> yes then any bread done on an uptodate bh will return immediately 
> without scheduling

Yes, it is pinned the whole time the FS is mounted.  It looks to me like
the reread_meta_blocks() function can go away.  The log replay should be
updating the same caches that the super block bh is in.

But, this isn't very performance critical stuff, it only happens in
reiserfs_fill_super().  I'd leave it.

-chris





^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/6] kill-the-BKL/reiserfs: release write lock on fs_changed()
  2009-05-01 13:44       ` Chris Mason
@ 2009-05-01 14:01         ` Frederic Weisbecker
  2009-05-01 14:14           ` Chris Mason
  0 siblings, 1 reply; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01 14:01 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andi Kleen, Ingo Molnar, LKML, Jeff Mahoney,
	ReiserFS Development List, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro

On Fri, May 01, 2009 at 09:44:16AM -0400, Chris Mason wrote:
> On Fri, 2009-05-01 at 15:28 +0200, Frederic Weisbecker wrote:
> > On Fri, May 01, 2009 at 08:31:12AM +0200, Andi Kleen wrote:
> > > Frederic Weisbecker <fweisbec@gmail.com> writes:
> > > >
> > > > diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h
> > > > index 6587b4e..397d281 100644
> > > > --- a/include/linux/reiserfs_fs.h
> > > > +++ b/include/linux/reiserfs_fs.h
> > > > @@ -1302,7 +1302,13 @@ static inline loff_t max_reiserfs_offset(struct inode *inode)
> > > >  #define get_generation(s) atomic_read (&fs_generation(s))
> > > >  #define FILESYSTEM_CHANGED_TB(tb)  (get_generation((tb)->tb_sb) != (tb)->fs_gen)
> > > >  #define __fs_changed(gen,s) (gen != get_generation (s))
> > > > -#define fs_changed(gen,s) ({cond_resched(); __fs_changed(gen, s);})
> > > > +#define fs_changed(gen,s)		\
> > > > +({					\
> > > > +	reiserfs_write_unlock(s);	\
> > > > +	cond_resched();			\
> > > > +	reiserfs_write_lock(s);		\
> > > 
> > > Did you try writing that 
> > > 
> > >     if (need_resched()) {               \
> > > 	reiserfs_write_unlock(s);	\
> > > 	cond_resched();			\  (or schedule(), but cond_resched does a loop)
> > > 	reiserfs_write_lock(s);		\
> > >     }				
> > > 
> > > ? That might give better performance under load because users will be better
> > > batched and you don't release the lock unnecessarily in the unloaded case.
> > 
> > 
> > 
> > Good catch!
> > And I guess this pattern matches most of the cond_resched()
> > all over the code (the only condition is that we must already hold
> > the write lock).
> > 
> > I will merge your idea and Ingo's one, write a
> > reiserfs_cond_resched() to have a helper which
> > factorizes this pattern.
> 
> The pattern you'll find goes like this:
> 
> lock_kernel()
> do some work
> do something that might schedule
> run fs_changed(), fixup as required.
> 
> In your setup it is translating to:
> 
> reiserfs_write_lock(s)
> do some work
> reiserfs_write_unlock(s)
> 
> do something that might schedule
> 
> reiserfs_write_lock(s)
> if (need_resched()) {
>     reiserfs_write_unlock(s)
>     cond_resched()
>     reiserfs_write_lock(s)
> }
> 
> if (__fs_changed()) fixup as required
> 
> You'll also find that item_moved is similar to __fs_changed() but more
> fine grained.
> 
> One easy optimization is to make an fs_changed_relock()
> 
> static inline int fs_changed_relock(gen, s) {
> 	cond_resched();
> 	reiserfs_write_lock(s)
> 	return __fs_changed(gen, s)
> }



Nice idea!
Does it means I can also replace the item_moved() calls to __fs_changed()?

They seem to not work the same way, I guess I should provide two different
helpers, depending on the check.


> 
> Another cause of scheduling is going to be reiserfs_prepare_for_journal.
> This function gets called before we modify a metadata buffer and it
> waits for IO to finish.
> 
> Not sure if your patch series already found it, but if you change this:
> 
> int reiserfs_prepare_for_journal(struct super_block *sb,
>                                  struct buffer_head *bh, int wait)
> {
>         PROC_INFO_INC(sb, journal.prepare);
> 
>         if (!trylock_buffer(bh)) {
>                 if (!wait)
>                         return 0;
>                 lock_buffer(bh);
>         }
> 
> Into:
> 
> 	if (!trylock_buffer(bh)) {
> 		if (!wait)
> 			return 0;
> 		reiserfs_write_unlock(s);
> 		wait_on_buffer(bh);
> 		reiserfs_write_lock(s);
> 		lock_buffer(bh);
> 	}
> 
> You'll catch a big cause of waiting for the disk with the lock held.


Again, good catch. I will try that.

I should also check the different other lock_buffer() which
indeed might sleep.

Thanks!


> -chris
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme
  2009-05-01 12:18   ` Thomas Meyer
@ 2009-05-01 14:12     ` Frederic Weisbecker
  0 siblings, 0 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01 14:12 UTC (permalink / raw)
  To: Thomas Meyer
  Cc: Ingo Molnar, LKML, Jeff Mahoney, ReiserFS Development List,
	Chris Mason, Alexander Beregalov, Alessio Igor Bogani, J

On Fri, May 01, 2009 at 02:18:36PM +0200, Thomas Meyer wrote:
> Am Freitag, den 01.05.2009, 07:35 +0200 schrieb Ingo Molnar:
> > 
> > I've pulled it and have also merged -rc4 into the kill-the-BKL tree, 
> > which can picked up from here:
> > 
> >    git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git core/kill-the-BKL
> > 
> > So for comparative benchmarking, vanilla v2.6.30-rc4 (which has the 
> > BKL) can be compared against latest kill-the-BKL
> 
> Hi,
> 
> Kernel gets soft-locked-up in somewhere in initrd:
> 
> See
> http://m3y3r.de/bilder/img_0524.jpg
> http://m3y3r.de/bilder/img_0525.jpg
> http://m3y3r.de/bilder/img_0526.jpg
> http://m3y3r.de/bilder/img_0527.jpg
> http://m3y3r.de/bilder/img_0528.jpg
> http://m3y3r.de/bilder/img_0529.jpg


Looks like someone owns the kernel lock and is sleeping
with the assumption it releases the lock, like the old bkl did.

 
> config is:
> http://m3y3r.de/bilder/config-2.6.30-rc4-ktb


Thanks, I will try it out once I have some time.


> greets
> thomas
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/6] kill-the-BKL/reiserfs: release write lock on fs_changed()
  2009-05-01 14:01         ` Frederic Weisbecker
@ 2009-05-01 14:14           ` Chris Mason
  2009-05-02  1:19             ` Frederic Weisbecker
  0 siblings, 1 reply; 39+ messages in thread
From: Chris Mason @ 2009-05-01 14:14 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andi Kleen, Ingo Molnar, LKML, Jeff Mahoney,
	ReiserFS Development List, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro

On Fri, 2009-05-01 at 16:01 +0200, Frederic Weisbecker wrote:
> On Fri, May 01, 2009 at 09:44:16AM -0400, Chris Mason wrote:
> > On Fri, 2009-05-01 at 15:28 +0200, Frederic Weisbecker wrote:
> > > On Fri, May 01, 2009 at 08:31:12AM +0200, Andi Kleen wrote:
> > > > Frederic Weisbecker <fweisbec@gmail.com> writes:
> > > > >
> > > > > diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h
> > > > > index 6587b4e..397d281 100644
> > > > > --- a/include/linux/reiserfs_fs.h
> > > > > +++ b/include/linux/reiserfs_fs.h
> > > > > @@ -1302,7 +1302,13 @@ static inline loff_t max_reiserfs_offset(struct inode *inode)
> > > > >  #define get_generation(s) atomic_read (&fs_generation(s))
> > > > >  #define FILESYSTEM_CHANGED_TB(tb)  (get_generation((tb)->tb_sb) != (tb)->fs_gen)
> > > > >  #define __fs_changed(gen,s) (gen != get_generation (s))
> > > > > -#define fs_changed(gen,s) ({cond_resched(); __fs_changed(gen, s);})
> > > > > +#define fs_changed(gen,s)		\
> > > > > +({					\
> > > > > +	reiserfs_write_unlock(s);	\
> > > > > +	cond_resched();			\
> > > > > +	reiserfs_write_lock(s);		\
> > > > 
> > > > Did you try writing that 
> > > > 
> > > >     if (need_resched()) {               \
> > > > 	reiserfs_write_unlock(s);	\
> > > > 	cond_resched();			\  (or schedule(), but cond_resched does a loop)
> > > > 	reiserfs_write_lock(s);		\
> > > >     }				
> > > > 
> > > > ? That might give better performance under load because users will be better
> > > > batched and you don't release the lock unnecessarily in the unloaded case.
> > > 
> > > 
> > > 
> > > Good catch!
> > > And I guess this pattern matches most of the cond_resched()
> > > all over the code (the only condition is that we must already hold
> > > the write lock).
> > > 
> > > I will merge your idea and Ingo's one, write a
> > > reiserfs_cond_resched() to have a helper which
> > > factorizes this pattern.
> > 
> > The pattern you'll find goes like this:
> > 
> > lock_kernel()
> > do some work
> > do something that might schedule
> > run fs_changed(), fixup as required.
> > 
> > In your setup it is translating to:
> > 
> > reiserfs_write_lock(s)
> > do some work
> > reiserfs_write_unlock(s)
> > 
> > do something that might schedule
> > 
> > reiserfs_write_lock(s)
> > if (need_resched()) {
> >     reiserfs_write_unlock(s)
> >     cond_resched()
> >     reiserfs_write_lock(s)
> > }
> > 
> > if (__fs_changed()) fixup as required
> > 
> > You'll also find that item_moved is similar to __fs_changed() but more
> > fine grained.
> > 
> > One easy optimization is to make an fs_changed_relock()
> > 
> > static inline int fs_changed_relock(gen, s) {
> > 	cond_resched();
> > 	reiserfs_write_lock(s)
> > 	return __fs_changed(gen, s)
> > }
> 
> 
> 
> Nice idea!
> Does it means I can also replace the item_moved() calls to __fs_changed()?
> 

Not quite, it looks like a common convention is also

if (fs_changed && item_moved) { fixup }

item_moved is the expensive check based on the contents of a btree
block, while fs_changed is a simple generation number.

> They seem to not work the same way, I guess I should provide two different
> helpers, depending on the check.
> 
> 
> > 
> > Another cause of scheduling is going to be reiserfs_prepare_for_journal.
> > This function gets called before we modify a metadata buffer and it
> > waits for IO to finish.
> > 
> > Not sure if your patch series already found it, but if you change this:
> > 
> > int reiserfs_prepare_for_journal(struct super_block *sb,
> >                                  struct buffer_head *bh, int wait)
> > {
> >         PROC_INFO_INC(sb, journal.prepare);
> > 
> >         if (!trylock_buffer(bh)) {
> >                 if (!wait)
> >                         return 0;
> >                 lock_buffer(bh);
> >         }
> > 
> > Into:
> > 
> > 	if (!trylock_buffer(bh)) {
> > 		if (!wait)
> > 			return 0;
> > 		reiserfs_write_unlock(s);
> > 		wait_on_buffer(bh);
> > 		reiserfs_write_lock(s);
> > 		lock_buffer(bh);
> > 	}
> > 
> > You'll catch a big cause of waiting for the disk with the lock held.
> 
> 
> Again, good catch. I will try that.
> 
> I should also check the different other lock_buffer() which
> indeed might sleep.

ftrace can help some.

cd /sys/kernel/debug/tracing
echo function > current_tracer
echo func_stack_trace > trace_options
echo schedule > set_ftrace_filter
echo 1 > tracing_on

trace-histogram < trace_pipe > /tmp/trace_output

(in another window run the reiserfs workload)

When done hit ctrl-c on the trace-histogram window

You'll get the most common causes of schedule during the run, groupd by
stack trace.  Just look for the ones that are done with the lock held.

trace-histogram is here:

http://oss.oracle.com/~mason/trace-histogram

-chris



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme
  2009-05-01  2:44 [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme Frederic Weisbecker
                   ` (6 preceding siblings ...)
  2009-05-01  5:35 ` [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme Ingo Molnar
@ 2009-05-01 19:59 ` Linus Torvalds
  2009-05-01 20:33   ` Ingo Molnar
  2009-05-01 21:32   ` Ingo Molnar
  2009-05-02  1:39 ` Frederic Weisbecker
  8 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2009-05-01 19:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Jeff Mahoney, ReiserFS Development List, Chris Mason,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro

On Fri, 1 May 2009, Frederic Weisbecker wrote:
> 
> This reiserfs patchset applies against latest tip:core/kill-the-BKL
> It adds various explicit write lock releases on specific sleeping sections.

Btw, is there any reason why it cannot just be re-based on top of standard 
-rc4?

I'd love to pull a "reiserfs: remove bkl" branch when the next merge 
window opens, but there's no way I'll pull the kill-bkl thing with all the 
odd random tty stuff etc that is totally unrelated.

And as far as I can tell, all your reiserfs patches are totally unrelated 
to all other changes, and it would be _much_ nicer to just merge the 
reiserfs ones.

So there's
 - the filesystem mount bkl pushdown

 - the reiserfs series

and quite frankly, I'll happily take the straightforward BKL pushdown, but 
I do _not_ want to see this mix-up with all the tty stuff, or even the NFS 
and RPC changes. 

In fact, if it makes it easier for people to merge, I can take the pure 
VFS push-down that was already Ack'ed by Al. But the rest of the stuff 
really isn't appropriate. 

For example, every single

	remove the BKL: remove "BKL auto-drop" assumption from xyz

patch in that series is pure and utter crap. I'm not going to take things 
like that. But it looks like your reiserfs patches really are totally 
independent of everything else, and should be merged as such.

		Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme
  2009-05-01 19:59 ` Linus Torvalds
@ 2009-05-01 20:33   ` Ingo Molnar
  2009-05-01 20:36     ` Ingo Molnar
  2009-05-01 21:32   ` Ingo Molnar
  1 sibling, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2009-05-01 20:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Frederic Weisbecker, LKML, Jeff Mahoney,
	ReiserFS Development List, Chris Mason, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 1 May 2009, Frederic Weisbecker wrote:
> > 
> > This reiserfs patchset applies against latest 
> > tip:core/kill-the-BKL It adds various explicit write lock 
> > releases on specific sleeping sections.
> 
> Btw, is there any reason why it cannot just be re-based on top of 
> standard -rc4?
> 
> I'd love to pull a "reiserfs: remove bkl" branch when the next 
> merge window opens, but there's no way I'll pull the kill-bkl 
> thing with all the odd random tty stuff etc that is totally 
> unrelated.
> 
> And as far as I can tell, all your reiserfs patches are totally 
> unrelated to all other changes, and it would be _much_ nicer to 
> just merge the reiserfs ones.

Sure, that's sensible. Perhaps the patches could show up in the VFS 
tree once they are stable enough and are acceptable to Al. It will 
take quite a long time for the complete BKL elimination to be 
finished in our tree.

> So there's
>  - the filesystem mount bkl pushdown
> 
>  - the reiserfs series
> 
> and quite frankly, I'll happily take the straightforward BKL 
> pushdown, but I do _not_ want to see this mix-up with all the tty 
> stuff, or even the NFS and RPC changes.
> 
> In fact, if it makes it easier for people to merge, I can take the 
> pure VFS push-down that was already Ack'ed by Al. But the rest of 
> the stuff really isn't appropriate.
> 
> For example, every single
> 
> 	remove the BKL: remove "BKL auto-drop" assumption from xyz
> 
> patch in that series is pure and utter crap. I'm not going to take 
> things like that. But it looks like your reiserfs patches really 
> are totally independent of everything else, and should be merged 
> as such.

Yeah, we dont need that upstream.

It would obviously be a basic act of honesty and fairness to admit 
that that "crazy thing" was the thing that spurred and enabled the 
"good" patches to begin with.

We can whitewash that piece of embarrasing history out of existence 
of course, but i think it is axiomatic that if the only demonstrated 
way to achieve a good end result is over a messy middle step, that 
messy middle step shouldnt really be declared non-existent - even if 
we can.

The BKL is clearly removed at a faster reate with such debugging 
measures in place. With such measures the BKL _really_ hurts, and 
very visibly so - and that results in active removal.

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme
  2009-05-01 20:33   ` Ingo Molnar
@ 2009-05-01 20:36     ` Ingo Molnar
  2009-05-01 21:11       ` Linus Torvalds
  0 siblings, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2009-05-01 20:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Frederic Weisbecker, LKML, Jeff Mahoney,
	ReiserFS Development List, Chris Mason, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro


* Ingo Molnar <mingo@elte.hu> wrote:

> The BKL is clearly removed at a faster reate with such debugging 
> measures in place. With such measures the BKL _really_ hurts, and 
> very visibly so - and that results in active removal.

Btw., if you can think of any way to create a cleaner debug tool 
here i'd be glad to start a new tree that adds only _that_ tool - 
and that could perhaps be dealt with independently and possibly 
pulled upstream too, if clean enough.

The problem is, i think it would still quack like a global mutex, 
would walk like a global mutex and would look like a global mutex
:-/

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme
  2009-05-01 20:36     ` Ingo Molnar
@ 2009-05-01 21:11       ` Linus Torvalds
  0 siblings, 0 replies; 39+ messages in thread
From: Linus Torvalds @ 2009-05-01 21:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, LKML, Jeff Mahoney,
	ReiserFS Development List, Chris Mason, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro

On Fri, 1 May 2009, Ingo Molnar wrote:
> * Ingo Molnar <mingo@elte.hu> wrote:
> 
> > The BKL is clearly removed at a faster reate with such debugging 
> > measures in place. With such measures the BKL _really_ hurts, and 
> > very visibly so - and that results in active removal.
> 
> Btw., if you can think of any way to create a cleaner debug tool 
> here i'd be glad to start a new tree that adds only _that_ tool - 
> and that could perhaps be dealt with independently and possibly 
> pulled upstream too, if clean enough.

Quite frankly, for something as clearly defined as as a filesystem, I 
would literally suggest starting out with a few trivial stages:

 Stage 1:
 - raw search-and-replace of [un]lock_kernel()

	sed 's/unlock_kernel()/reiserfs_unlock(sb)/g'
	sed 's/lock_kernel()/reiserfs_lock(sb)/g'

   and then obviously fix it up so that 'sb' always exists (ie you would 
   need to add a a few

	struct super_block *sb = inode->i_sb;

   lines manually to make the end result work.

 - add that 'reiserfs_[un]lock()' pair around every single VFS entry-point 
   that currently gets called with lock_kernel(). They'd _still_ get 
   called with lock-kernel too (you're not fixing that), but now they'd 
   _also_ take the new lock.

 - make reiserfs_unlock(sb) just do [un]lock_kernel(), and use that as a 
   known good starting point. Nothing has really changed (and you in fact 
   _increased_ the nesting on the BKL, since now the BKL is gotten twice 
   in those areas that were called from the VFS with the BKL held), but 
   you now have a mindless and pretty trivial conversion where you still 
   have a working system, and nothing really changed - except you now have 
   the beginnings of a nice abstraction.

IOW, "stage 1" is all totally mindless, and never breaks anything. It's 
very much designed to be "obviously correct".

 Stage 2:
 - switch over reiserfs_[un]lock() to a per-superblock mutex, and enable 
   lockdep, and start looking for nesting problems.

IOW, "stage 2" is when you make a _minimal_ change to now enable all the 
good lockdep infrastructure. But it's also designed to just do _one_ 
thing: look at nesting. Nothing else.

 Stage 3:
 - once you've fixed all nesting problems (well, most of them - enable a 
   swapfile on that filesystem and I bet you'll find a few more), you 
   might want to look at performance issues. In particular, you want to 
   drop the lock over at least the most _common_ blocking operations, if 
   at all possible.

And stage 3 is where it would make absolutely _tons_ of sense to just add 
some very simple infrastructure for having a per-thread "IO warning 
counter", and then simply increment/decrement that counter in 
reiserfs_[un]lock().

But exactly because we do NOT want to get warnings about BKL use in all 
the _other_ subsystems, we don't want to mix this up with the BKL counter 
that we already have. You could probably even use the tracing 
infrastructure to do this: enable tracing in reiserfs_lock(), disable it 
in reiserfs_unlock(), and look at what you catch in between.

I dunno. That's how I would go about it if I just wanted to clean up a 
specific subsystem, and didn't want to tie it together with getting 
everything else right too.

			Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme
  2009-05-01 19:59 ` Linus Torvalds
  2009-05-01 20:33   ` Ingo Molnar
@ 2009-05-01 21:32   ` Ingo Molnar
  1 sibling, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2009-05-01 21:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Frederic Weisbecker, LKML, Jeff Mahoney,
	ReiserFS Development List, Chris Mason, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 1 May 2009, Frederic Weisbecker wrote:
> > 
> > This reiserfs patchset applies against latest 
> > tip:core/kill-the-BKL It adds various explicit write lock 
> > releases on specific sleeping sections.
> 
> Btw, is there any reason why it cannot just be re-based on top of 
> standard -rc4?
> 
> I'd love to pull a "reiserfs: remove bkl" branch when the next 
> merge window opens, but there's no way I'll pull the kill-bkl 
> thing with all the odd random tty stuff etc that is totally 
> unrelated.

Btw., i can name another reason why we'd want to do reiserfs 
separately: if our testing and efforts are any proof, then reiser3 
turned out to be the hardest BKL nut to crack, by a wide margin.

All the other hacks in kill-the-BKL are really of relatively low 
complexity and really just tried to map out the problem areas. Even 
the tty ones are simple - just a few recursion assumptions.

We _suspected_ that kind of status quo before, but we never had any 
conclusive proof of that. I think we now know that for sure, and we 
have to fear the BKL no more.

After the reiser3 conversion we should just tackle all the other BKL 
users one by one, and go for a straight subsystem mutex in every 
case: it will cause little trouble, and it will be a job that can be 
finished within a reasonable time frame.

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/6] kill-the-BKL/reiserfs: release the write lock before rescheduling on do_journal_end()
  2009-05-01 13:31     ` Frederic Weisbecker
@ 2009-05-01 22:20       ` Frederic Weisbecker
  0 siblings, 0 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-01 22:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Jeff Mahoney, ReiserFS Development List, Chris Mason,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro

On Fri, May 01, 2009 at 03:31:48PM +0200, Frederic Weisbecker wrote:
> On Fri, May 01, 2009 at 09:09:11AM +0200, Ingo Molnar wrote:
> > 
> > * Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > 
> > > When do_journal_end() copies data to the journal blocks buffers in memory,
> > > it reschedules if needed between each block copied and dirtyfied.
> > > 
> > > We can also release the write lock at this rescheduling stage,
> > > like did the bkl implicitly.
> > > 
> > > [ Impact: release the reiserfs write lock when it is not needed ]
> > > 
> > > Cc: Jeff Mahoney <jeffm@suse.com>
> > > Cc: Chris Mason <chris.mason@oracle.com>
> > > Cc: Alexander Beregalov <a.beregalov@gmail.com>
> > > Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
> > > ---
> > >  fs/reiserfs/journal.c |    2 ++
> > >  1 files changed, 2 insertions(+), 0 deletions(-)
> > > 
> > > diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
> > > index 7976d7d..373d080 100644
> > > --- a/fs/reiserfs/journal.c
> > > +++ b/fs/reiserfs/journal.c
> > > @@ -4232,7 +4232,9 @@ static int do_journal_end(struct reiserfs_transaction_handle *th,
> > >  		next = cn->next;
> > >  		free_cnode(sb, cn);
> > >  		cn = next;
> > > +		reiserfs_write_unlock(sb);
> > >  		cond_resched();
> > > +		reiserfs_write_lock(sb);
> > 
> > There's 8 cond_resched()'s in the code - i'd suggest to introduce a 
> > helper for this - reiserfs_write_cond_resched(sb) or so?
> > 
> > 	Ingo
> 
> Indeed, that + the pattern suggested by Al.



s/Al/Andi

Sorry for the confusion.

 
> Thanks.
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/6] kill-the-BKL/reiserfs: release write lock on fs_changed()
  2009-05-01 14:14           ` Chris Mason
@ 2009-05-02  1:19             ` Frederic Weisbecker
  0 siblings, 0 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-02  1:19 UTC (permalink / raw)
  To: Chris Mason
  Cc: Andi Kleen, Ingo Molnar, LKML, Jeff Mahoney,
	ReiserFS Development List, Alexander Beregalov,
	Alessio Igor Bogani, Jonathan Corbet, Alexander Viro

On Fri, May 01, 2009 at 10:14:01AM -0400, Chris Mason wrote:
> On Fri, 2009-05-01 at 16:01 +0200, Frederic Weisbecker wrote:
> > On Fri, May 01, 2009 at 09:44:16AM -0400, Chris Mason wrote:
> > > On Fri, 2009-05-01 at 15:28 +0200, Frederic Weisbecker wrote:
> > > > On Fri, May 01, 2009 at 08:31:12AM +0200, Andi Kleen wrote:
> > > > > Frederic Weisbecker <fweisbec@gmail.com> writes:
> > > > > >
> > > > > > diff --git a/include/linux/reiserfs_fs.h b/include/linux/reiserfs_fs.h
> > > > > > index 6587b4e..397d281 100644
> > > > > > --- a/include/linux/reiserfs_fs.h
> > > > > > +++ b/include/linux/reiserfs_fs.h
> > > > > > @@ -1302,7 +1302,13 @@ static inline loff_t max_reiserfs_offset(struct inode *inode)
> > > > > >  #define get_generation(s) atomic_read (&fs_generation(s))
> > > > > >  #define FILESYSTEM_CHANGED_TB(tb)  (get_generation((tb)->tb_sb) != (tb)->fs_gen)
> > > > > >  #define __fs_changed(gen,s) (gen != get_generation (s))
> > > > > > -#define fs_changed(gen,s) ({cond_resched(); __fs_changed(gen, s);})
> > > > > > +#define fs_changed(gen,s)		\
> > > > > > +({					\
> > > > > > +	reiserfs_write_unlock(s);	\
> > > > > > +	cond_resched();			\
> > > > > > +	reiserfs_write_lock(s);		\
> > > > > 
> > > > > Did you try writing that 
> > > > > 
> > > > >     if (need_resched()) {               \
> > > > > 	reiserfs_write_unlock(s);	\
> > > > > 	cond_resched();			\  (or schedule(), but cond_resched does a loop)
> > > > > 	reiserfs_write_lock(s);		\
> > > > >     }				
> > > > > 
> > > > > ? That might give better performance under load because users will be better
> > > > > batched and you don't release the lock unnecessarily in the unloaded case.
> > > > 
> > > > 
> > > > 
> > > > Good catch!
> > > > And I guess this pattern matches most of the cond_resched()
> > > > all over the code (the only condition is that we must already hold
> > > > the write lock).
> > > > 
> > > > I will merge your idea and Ingo's one, write a
> > > > reiserfs_cond_resched() to have a helper which
> > > > factorizes this pattern.
> > > 
> > > The pattern you'll find goes like this:
> > > 
> > > lock_kernel()
> > > do some work
> > > do something that might schedule
> > > run fs_changed(), fixup as required.
> > > 
> > > In your setup it is translating to:
> > > 
> > > reiserfs_write_lock(s)
> > > do some work
> > > reiserfs_write_unlock(s)
> > > 
> > > do something that might schedule
> > > 
> > > reiserfs_write_lock(s)
> > > if (need_resched()) {
> > >     reiserfs_write_unlock(s)
> > >     cond_resched()
> > >     reiserfs_write_lock(s)
> > > }
> > > 
> > > if (__fs_changed()) fixup as required
> > > 
> > > You'll also find that item_moved is similar to __fs_changed() but more
> > > fine grained.
> > > 
> > > One easy optimization is to make an fs_changed_relock()
> > > 
> > > static inline int fs_changed_relock(gen, s) {
> > > 	cond_resched();
> > > 	reiserfs_write_lock(s)
> > > 	return __fs_changed(gen, s)
> > > }
> > 
> > 
> > 
> > Nice idea!
> > Does it means I can also replace the item_moved() calls to __fs_changed()?
> > 
> 
> Not quite, it looks like a common convention is also
> 
> if (fs_changed && item_moved) { fixup }
> 
> item_moved is the expensive check based on the contents of a btree
> block, while fs_changed is a simple generation number.


Ok.

 
> > They seem to not work the same way, I guess I should provide two different
> > helpers, depending on the check.
> > 
> > 
> > > 
> > > Another cause of scheduling is going to be reiserfs_prepare_for_journal.
> > > This function gets called before we modify a metadata buffer and it
> > > waits for IO to finish.
> > > 
> > > Not sure if your patch series already found it, but if you change this:
> > > 
> > > int reiserfs_prepare_for_journal(struct super_block *sb,
> > >                                  struct buffer_head *bh, int wait)
> > > {
> > >         PROC_INFO_INC(sb, journal.prepare);
> > > 
> > >         if (!trylock_buffer(bh)) {
> > >                 if (!wait)
> > >                         return 0;
> > >                 lock_buffer(bh);
> > >         }
> > > 
> > > Into:
> > > 
> > > 	if (!trylock_buffer(bh)) {
> > > 		if (!wait)
> > > 			return 0;
> > > 		reiserfs_write_unlock(s);
> > > 		wait_on_buffer(bh);
> > > 		reiserfs_write_lock(s);
> > > 		lock_buffer(bh);
> > > 	}
> > > 
> > > You'll catch a big cause of waiting for the disk with the lock held.
> > 
> > 
> > Again, good catch. I will try that.
> > 
> > I should also check the different other lock_buffer() which
> > indeed might sleep.
> 
> ftrace can help some.
> 
> cd /sys/kernel/debug/tracing
> echo function > current_tracer
> echo func_stack_trace > trace_options
> echo schedule > set_ftrace_filter
> echo 1 > tracing_on
> 
> trace-histogram < trace_pipe > /tmp/trace_output
> 
> (in another window run the reiserfs workload)
> 
> When done hit ctrl-c on the trace-histogram window
> 
> You'll get the most common causes of schedule during the run, groupd by
> stack trace.  Just look for the ones that are done with the lock held.
> 
> trace-histogram is here:
> 
> http://oss.oracle.com/~mason/trace-histogram
> 
> -chris


Thanks, looks like a useful script! May be it could be merged
in scripts/tracing ? I will try it out.

Recently I used a specific thing to find the hotspots which schedule
with the lock:

I enabled PREEMPT_NONE on an UP box.
And once we find the lock contended, we print the
stacktrace of the lock owner, we are sure that he is
sleeping and that he does it voluntarily because of PREEMPT_NONE
and the UP.

The latest reiserfs patchset is a result of this trick.
I will try yours now that I'm testing on SMP preempt :)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme
  2009-05-01  2:44 [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme Frederic Weisbecker
                   ` (7 preceding siblings ...)
  2009-05-01 19:59 ` Linus Torvalds
@ 2009-05-02  1:39 ` Frederic Weisbecker
  8 siblings, 0 replies; 39+ messages in thread
From: Frederic Weisbecker @ 2009-05-02  1:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Jeff Mahoney, ReiserFS Development List, Chris Mason,
	Alexander Beregalov, Alessio Igor Bogani, Jonathan Corbet,
	Alexander Viro

On Fri, May 01, 2009 at 04:44:16AM +0200, Frederic Weisbecker wrote:
> Hi,
> 
> This reiserfs patchset applies against latest tip:core/kill-the-BKL
> It adds various explicit write lock releases on specific sleeping sections.
> 
> A performance test with dbench on UP with 100 processus during 100 seconds
> gives the following results:
> 
> Locking		  Throughput 
> 
> Bkl:              11.2587 MB/s
> Write lock/Mutex: 12.5713 MB/s


I just tested on SMP with two CPU:

Write Lock/Mutex: Throughput 66.9269 MB/sec 100 procs
Bkl:              Throughput 71.9998 MB/sec 100 procs

So:

UP:  11% faster
SMP:  8% slower


I guess there is still some work to do :)


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2009-05-02  1:39 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-01  2:44 [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme Frederic Weisbecker
2009-05-01  2:44 ` [PATCH 1/6] kill-the-BKL/reiserfs: release write lock on fs_changed() Frederic Weisbecker
2009-05-01  6:31   ` Andi Kleen
2009-05-01 13:28     ` Frederic Weisbecker
2009-05-01 13:44       ` Chris Mason
2009-05-01 14:01         ` Frederic Weisbecker
2009-05-01 14:14           ` Chris Mason
2009-05-02  1:19             ` Frederic Weisbecker
2009-05-01  2:44 ` [PATCH 2/6] kill-the-BKL/reiserfs: release the write lock before rescheduling on do_journal_end() Frederic Weisbecker
2009-05-01  7:09   ` Ingo Molnar
2009-05-01 13:31     ` Frederic Weisbecker
2009-05-01 22:20       ` Frederic Weisbecker
2009-05-01  2:44 ` [PATCH 3/6] kill-the-BKL/reiserfs: release write lock while rescheduling on prepare_for_delete_or_cut() Frederic Weisbecker
2009-05-01  2:44 ` [PATCH 4/6] kill-the-BKL/reiserfs: release the write lock inside get_neighbors() Frederic Weisbecker
2009-05-01  5:51   ` Ingo Molnar
2009-05-01 13:25     ` Frederic Weisbecker
2009-05-01 13:29       ` Chris Mason
2009-05-01 13:31         ` Ingo Molnar
2009-05-01  2:44 ` [PATCH 5/6] kill-the-BKL/reiserfs: release the write lock inside reiserfs_read_bitmap_block() Frederic Weisbecker
2009-05-01  5:47   ` Ingo Molnar
2009-05-01 13:19     ` Frederic Weisbecker
2009-05-01 13:30       ` Chris Mason
2009-05-01 13:51         ` Frederic Weisbecker
2009-05-01  2:44 ` [PATCH 6/6] kill-the-BKL/reiserfs: release the write lock on flush_commit_list() Frederic Weisbecker
2009-05-01  5:42   ` Ingo Molnar
2009-05-01 13:13     ` Frederic Weisbecker
2009-05-01 13:23       ` Ingo Molnar
2009-05-01 13:26       ` Chris Mason
2009-05-01 13:29         ` Ingo Molnar
2009-05-01 13:54           ` Chris Mason
2009-05-01  5:35 ` [PATCH 0/6] kill-the-BKL/reiserfs3: performance improvements, faster than Bkl based scheme Ingo Molnar
2009-05-01 12:18   ` Thomas Meyer
2009-05-01 14:12     ` Frederic Weisbecker
2009-05-01 19:59 ` Linus Torvalds
2009-05-01 20:33   ` Ingo Molnar
2009-05-01 20:36     ` Ingo Molnar
2009-05-01 21:11       ` Linus Torvalds
2009-05-01 21:32   ` Ingo Molnar
2009-05-02  1:39 ` Frederic Weisbecker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.