linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7 v2] jbd2: Bit spinlock conversions
@ 2019-08-09 12:42 Jan Kara
  2019-08-09 12:42 ` [PATCH 1/7] jbd2: Simplify journal_unmap_buffer() Jan Kara
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Jan Kara @ 2019-08-09 12:42 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Anna-Maria Gleixner, Julia Cartwright, Jan Kara

Hello,

This series is derived from Thomas' series to get rid of bit spinlocks in
buffer head code. These patches convert BH_State bit spinlock to an ordinary
spinlock inside struct journal_head and somewhat reduce the critical section
under BH_JournalHead bit spinlock so that it is fine for RT. 

Motivation from original Thomas' series:

Bit spinlocks are problematic if PREEMPT_RT is enabled. They disable
preemption, which is undesired for latency reasons and breaks when regular
spinlocks are taken within the bit_spinlock locked region because regular
spinlocks are converted to 'sleeping spinlocks' on RT.

Bit spinlocks are also not covered by lock debugging, e.g. lockdep. With
the spinlock substitution in place, they can be exposed via a new config
switch: CONFIG_DEBUG_BIT_SPINLOCKS.

Ted, can you pick up these patches? Thanks!

Changes since v1:
* Fixed up compilation breakage on UP due to missing linux/spinlock.h include

								Honza

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/7] jbd2: Simplify journal_unmap_buffer()
  2019-08-09 12:42 [PATCH 0/7 v2] jbd2: Bit spinlock conversions Jan Kara
@ 2019-08-09 12:42 ` Jan Kara
  2019-08-09 12:42 ` [PATCH 2/7] jbd2: Remove jbd_trylock_bh_state() Jan Kara
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2019-08-09 12:42 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Anna-Maria Gleixner, Julia Cartwright, Jan Kara

From: Thomas Gleixner <tglx@linutronix.de>

journal_unmap_buffer() checks first whether the buffer head is a journal.
If so it takes locks and then invokes jbd2_journal_grab_journal_head()
followed by another check whether this is journal head buffer.

The double checking is pointless.

Replace the initial check with jbd2_journal_grab_journal_head() which
alredy checks whether the buffer head is actually a journal.

Allows also early access to the journal head pointer for the upcoming
conversion of state lock to a regular spinlock.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 990e7b5062e7..f9b81cfdd8be 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2196,7 +2196,8 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh,
 	 * holding the page lock. --sct
 	 */
 
-	if (!buffer_jbd(bh))
+	jh = jbd2_journal_grab_journal_head(bh);
+	if (!jh)
 		goto zap_buffer_unlocked;
 
 	/* OK, we have data buffer in journaled mode */
@@ -2204,10 +2205,6 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh,
 	jbd_lock_bh_state(bh);
 	spin_lock(&journal->j_list_lock);
 
-	jh = jbd2_journal_grab_journal_head(bh);
-	if (!jh)
-		goto zap_buffer_no_jh;
-
 	/*
 	 * We cannot remove the buffer from checkpoint lists until the
 	 * transaction adding inode to orphan list (let's call it T)
@@ -2329,7 +2326,6 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh,
 	 */
 	jh->b_modified = 0;
 	jbd2_journal_put_journal_head(jh);
-zap_buffer_no_jh:
 	spin_unlock(&journal->j_list_lock);
 	jbd_unlock_bh_state(bh);
 	write_unlock(&journal->j_state_lock);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/7] jbd2: Remove jbd_trylock_bh_state()
  2019-08-09 12:42 [PATCH 0/7 v2] jbd2: Bit spinlock conversions Jan Kara
  2019-08-09 12:42 ` [PATCH 1/7] jbd2: Simplify journal_unmap_buffer() Jan Kara
@ 2019-08-09 12:42 ` Jan Kara
  2019-08-09 12:42 ` [PATCH 3/7] jbd2: Move dropping of jh reference out of un/re-filing functions Jan Kara
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2019-08-09 12:42 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Anna-Maria Gleixner, Julia Cartwright, Jan Kara

From: Thomas Gleixner <tglx@linutronix.de>

No users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/jbd2.h | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index df03825ad1a1..f53b13b20e83 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -347,11 +347,6 @@ static inline void jbd_lock_bh_state(struct buffer_head *bh)
 	bit_spin_lock(BH_State, &bh->b_state);
 }
 
-static inline int jbd_trylock_bh_state(struct buffer_head *bh)
-{
-	return bit_spin_trylock(BH_State, &bh->b_state);
-}
-
 static inline int jbd_is_locked_bh_state(struct buffer_head *bh)
 {
 	return bit_spin_is_locked(BH_State, &bh->b_state);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/7] jbd2: Move dropping of jh reference out of un/re-filing functions
  2019-08-09 12:42 [PATCH 0/7 v2] jbd2: Bit spinlock conversions Jan Kara
  2019-08-09 12:42 ` [PATCH 1/7] jbd2: Simplify journal_unmap_buffer() Jan Kara
  2019-08-09 12:42 ` [PATCH 2/7] jbd2: Remove jbd_trylock_bh_state() Jan Kara
@ 2019-08-09 12:42 ` Jan Kara
  2019-08-09 12:42 ` [PATCH 4/7] jbd2: Drop unnecessary branch from jbd2_journal_forget() Jan Kara
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2019-08-09 12:42 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Anna-Maria Gleixner, Julia Cartwright, Jan Kara

__jbd2_journal_unfile_buffer() and __jbd2_journal_refile_buffer() drop
transaction's jh reference when they remove jh from a transaction. This
will be however inconvenient once we move state lock into journal_head
itself as we still need to unlock it and we'd need to grab jh reference
just for that. Move dropping of jh reference out of these functions into
the few callers.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/commit.c      |  5 ++++-
 fs/jbd2/transaction.c | 23 +++++++++++++++--------
 include/linux/jbd2.h  |  2 +-
 3 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 132fb92098c7..a0a191f3df77 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -918,6 +918,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 		transaction_t *cp_transaction;
 		struct buffer_head *bh;
 		int try_to_free = 0;
+		bool drop_ref;
 
 		jh = commit_transaction->t_forget;
 		spin_unlock(&journal->j_list_lock);
@@ -1022,8 +1023,10 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 				try_to_free = 1;
 		}
 		JBUFFER_TRACE(jh, "refile or unfile buffer");
-		__jbd2_journal_refile_buffer(jh);
+		drop_ref = __jbd2_journal_refile_buffer(jh);
 		jbd_unlock_bh_state(bh);
+		if (drop_ref)
+			jbd2_journal_put_journal_head(jh);
 		if (try_to_free)
 			release_buffer_page(bh);	/* Drops bh reference */
 		else
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index f9b81cfdd8be..c4fe050e78c6 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1595,6 +1595,7 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
 			__jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
 		} else {
 			__jbd2_journal_unfile_buffer(jh);
+			jbd2_journal_put_journal_head(jh);
 			if (!buffer_jbd(bh)) {
 				spin_unlock(&journal->j_list_lock);
 				goto not_jbd;
@@ -1968,17 +1969,15 @@ static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh)
 }
 
 /*
- * Remove buffer from all transactions.
+ * Remove buffer from all transactions. The caller is responsible for dropping
+ * the jh reference that belonged to the transaction.
  *
  * Called with bh_state lock and j_list_lock
- *
- * jh and bh may be already freed when this function returns.
  */
 static void __jbd2_journal_unfile_buffer(struct journal_head *jh)
 {
 	__jbd2_journal_temp_unlink_buffer(jh);
 	jh->b_transaction = NULL;
-	jbd2_journal_put_journal_head(jh);
 }
 
 void jbd2_journal_unfile_buffer(journal_t *journal, struct journal_head *jh)
@@ -1992,6 +1991,7 @@ void jbd2_journal_unfile_buffer(journal_t *journal, struct journal_head *jh)
 	__jbd2_journal_unfile_buffer(jh);
 	spin_unlock(&journal->j_list_lock);
 	jbd_unlock_bh_state(bh);
+	jbd2_journal_put_journal_head(jh);
 	__brelse(bh);
 }
 
@@ -2130,6 +2130,7 @@ static int __dispose_buffer(struct journal_head *jh, transaction_t *transaction)
 	} else {
 		JBUFFER_TRACE(jh, "on running transaction");
 		__jbd2_journal_unfile_buffer(jh);
+		jbd2_journal_put_journal_head(jh);
 	}
 	return may_free;
 }
@@ -2493,9 +2494,11 @@ void jbd2_journal_file_buffer(struct journal_head *jh,
  * Called under j_list_lock
  * Called under jbd_lock_bh_state(jh2bh(jh))
  *
- * jh and bh may be already free when this function returns
+ * When this function returns true, there's no next transaction to refile to
+ * and the caller has to drop jh reference through
+ * jbd2_journal_put_journal_head().
  */
-void __jbd2_journal_refile_buffer(struct journal_head *jh)
+bool __jbd2_journal_refile_buffer(struct journal_head *jh)
 {
 	int was_dirty, jlist;
 	struct buffer_head *bh = jh2bh(jh);
@@ -2507,7 +2510,7 @@ void __jbd2_journal_refile_buffer(struct journal_head *jh)
 	/* If the buffer is now unused, just drop it. */
 	if (jh->b_next_transaction == NULL) {
 		__jbd2_journal_unfile_buffer(jh);
-		return;
+		return true;
 	}
 
 	/*
@@ -2535,6 +2538,7 @@ void __jbd2_journal_refile_buffer(struct journal_head *jh)
 
 	if (was_dirty)
 		set_buffer_jbddirty(bh);
+	return false;
 }
 
 /*
@@ -2546,15 +2550,18 @@ void __jbd2_journal_refile_buffer(struct journal_head *jh)
 void jbd2_journal_refile_buffer(journal_t *journal, struct journal_head *jh)
 {
 	struct buffer_head *bh = jh2bh(jh);
+	bool drop;
 
 	/* Get reference so that buffer cannot be freed before we unlock it */
 	get_bh(bh);
 	jbd_lock_bh_state(bh);
 	spin_lock(&journal->j_list_lock);
-	__jbd2_journal_refile_buffer(jh);
+	drop = __jbd2_journal_refile_buffer(jh);
 	jbd_unlock_bh_state(bh);
 	spin_unlock(&journal->j_list_lock);
 	__brelse(bh);
+	if (drop)
+		jbd2_journal_put_journal_head(jh);
 }
 
 /*
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index f53b13b20e83..6aa66c42599d 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1252,7 +1252,7 @@ JBD2_FEATURE_INCOMPAT_FUNCS(csum3,		CSUM_V3)
 
 /* Filing buffers */
 extern void jbd2_journal_unfile_buffer(journal_t *, struct journal_head *);
-extern void __jbd2_journal_refile_buffer(struct journal_head *);
+extern bool __jbd2_journal_refile_buffer(struct journal_head *);
 extern void jbd2_journal_refile_buffer(journal_t *, struct journal_head *);
 extern void __jbd2_journal_file_buffer(struct journal_head *, transaction_t *, int);
 extern void __journal_free_buffer(struct journal_head *bh);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/7] jbd2: Drop unnecessary branch from jbd2_journal_forget()
  2019-08-09 12:42 [PATCH 0/7 v2] jbd2: Bit spinlock conversions Jan Kara
                   ` (2 preceding siblings ...)
  2019-08-09 12:42 ` [PATCH 3/7] jbd2: Move dropping of jh reference out of un/re-filing functions Jan Kara
@ 2019-08-09 12:42 ` Jan Kara
  2019-08-09 12:42 ` [PATCH 5/7] jbd2: Don't call __bforget() unnecessarily Jan Kara
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2019-08-09 12:42 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Anna-Maria Gleixner, Julia Cartwright, Jan Kara

We have cleared both dirty & jbddirty bits from the bh. So there's no
difference between bforget() and brelse(). Thus there's no point jumping
to no_jbd branch.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index c4fe050e78c6..9ccef3d6e817 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1596,10 +1596,6 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
 		} else {
 			__jbd2_journal_unfile_buffer(jh);
 			jbd2_journal_put_journal_head(jh);
-			if (!buffer_jbd(bh)) {
-				spin_unlock(&journal->j_list_lock);
-				goto not_jbd;
-			}
 		}
 		spin_unlock(&journal->j_list_lock);
 	} else if (jh->b_transaction) {
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/7] jbd2: Don't call __bforget() unnecessarily
  2019-08-09 12:42 [PATCH 0/7 v2] jbd2: Bit spinlock conversions Jan Kara
                   ` (3 preceding siblings ...)
  2019-08-09 12:42 ` [PATCH 4/7] jbd2: Drop unnecessary branch from jbd2_journal_forget() Jan Kara
@ 2019-08-09 12:42 ` Jan Kara
  2019-10-28 15:28   ` Theodore Y. Ts'o
  2019-08-09 12:42 ` [PATCH 6/7] jbd2: Make state lock a spinlock Jan Kara
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 14+ messages in thread
From: Jan Kara @ 2019-08-09 12:42 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Anna-Maria Gleixner, Julia Cartwright, Jan Kara

jbd2_journal_forget() jumps to 'not_jbd' branch which calls __bforget()
in cases where the buffer is clean which is pointless. In case of failed
assertion, it can be even argued that it is safer not to touch buffer's
dirty bits. Also logically it makes more sense to just jump to 'drop'
and that will make logic also simpler when we switch bh_state_lock to a
spinlock.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 9ccef3d6e817..d752d7f6929e 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1547,7 +1547,7 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
 	if (!J_EXPECT_JH(jh, !jh->b_committed_data,
 			 "inconsistent data on disk")) {
 		err = -EIO;
-		goto not_jbd;
+		goto drop;
 	}
 
 	/* keep track of whether or not this transaction modified us */
@@ -1637,7 +1637,7 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
 		if (!jh->b_cp_transaction) {
 			JBUFFER_TRACE(jh, "belongs to none transaction");
 			spin_unlock(&journal->j_list_lock);
-			goto not_jbd;
+			goto drop;
 		}
 
 		/*
@@ -1647,7 +1647,7 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
 		if (!buffer_dirty(bh)) {
 			__jbd2_journal_remove_checkpoint(jh);
 			spin_unlock(&journal->j_list_lock);
-			goto not_jbd;
+			goto drop;
 		}
 
 		/*
@@ -1660,10 +1660,9 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
 		__jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
 		spin_unlock(&journal->j_list_lock);
 	}
-
+drop:
 	jbd_unlock_bh_state(bh);
 	__brelse(bh);
-drop:
 	if (drop_reserve) {
 		/* no need to reserve log space for this block -bzzz */
 		handle->h_buffer_credits++;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 6/7] jbd2: Make state lock a spinlock
  2019-08-09 12:42 [PATCH 0/7 v2] jbd2: Bit spinlock conversions Jan Kara
                   ` (4 preceding siblings ...)
  2019-08-09 12:42 ` [PATCH 5/7] jbd2: Don't call __bforget() unnecessarily Jan Kara
@ 2019-08-09 12:42 ` Jan Kara
  2019-08-09 12:42 ` [PATCH 7/7] jbd2: Free journal head outside of locked region Jan Kara
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2019-08-09 12:42 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Anna-Maria Gleixner, Julia Cartwright, Jan Kara,
	Mark Fasheh, Joseph Qi, Joel Becker, Jan Kara

From: Thomas Gleixner <tglx@linutronix.de>

Bit-spinlocks are problematic on PREEMPT_RT if functions which might sleep
on RT, e.g. spin_lock(), alloc/free(), are invoked inside the lock held
region because bit spinlocks disable preemption even on RT.

A first attempt was to replace state lock with a spinlock placed in struct
buffer_head and make the locking conditional on PREEMPT_RT and
DEBUG_BIT_SPINLOCKS.

Jan pointed out that there is a 4 byte hole in struct journal_head where a
regular spinlock fits in and he would not object to convert the state lock
to a spinlock unconditionally.

Aside of solving the RT problem, this also gains lockdep coverage for the
journal head state lock (bit-spinlocks are not covered by lockdep as it's
hard to fit a lockdep map into a single bit).

The trivial change would have been to convert the jbd_*lock_bh_state()
inlines, but that comes with the downside that these functions take a
buffer head pointer which needs to be converted to a journal head pointer
which adds another level of indirection.

As almost all functions which use this lock have a journal head pointer
readily available, it makes more sense to remove the lock helper inlines
and write out spin_*lock() at all call sites.

Fixup all locking comments as well.

Suggested-by: Jan Kara <jack@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jan Kara <jack@suse.com>
Cc: linux-ext4@vger.kernel.org
---
 fs/jbd2/commit.c             |   8 ++--
 fs/jbd2/journal.c            |  10 +++--
 fs/jbd2/transaction.c        | 100 ++++++++++++++++++++-----------------------
 fs/ocfs2/suballoc.c          |  19 ++++----
 include/linux/jbd2.h         |  20 +--------
 include/linux/journal-head.h |  21 ++++++---
 6 files changed, 84 insertions(+), 94 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index a0a191f3df77..8e1ff875de43 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -482,10 +482,10 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 		if (jh->b_committed_data) {
 			struct buffer_head *bh = jh2bh(jh);
 
-			jbd_lock_bh_state(bh);
+			spin_lock(&jh->b_state_lock);
 			jbd2_free(jh->b_committed_data, bh->b_size);
 			jh->b_committed_data = NULL;
-			jbd_unlock_bh_state(bh);
+			spin_unlock(&jh->b_state_lock);
 		}
 		jbd2_journal_refile_buffer(journal, jh);
 	}
@@ -928,7 +928,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 		 * done with it.
 		 */
 		get_bh(bh);
-		jbd_lock_bh_state(bh);
+		spin_lock(&jh->b_state_lock);
 		J_ASSERT_JH(jh,	jh->b_transaction == commit_transaction);
 
 		/*
@@ -1024,7 +1024,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 		}
 		JBUFFER_TRACE(jh, "refile or unfile buffer");
 		drop_ref = __jbd2_journal_refile_buffer(jh);
-		jbd_unlock_bh_state(bh);
+		spin_unlock(&jh->b_state_lock);
 		if (drop_ref)
 			jbd2_journal_put_journal_head(jh);
 		if (try_to_free)
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 953990eb70a9..eee350fef339 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -365,7 +365,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 	/* keep subsequent assertions sane */
 	atomic_set(&new_bh->b_count, 1);
 
-	jbd_lock_bh_state(bh_in);
+	spin_lock(&jh_in->b_state_lock);
 repeat:
 	/*
 	 * If a new transaction has already done a buffer copy-out, then
@@ -407,13 +407,13 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 	if (need_copy_out && !done_copy_out) {
 		char *tmp;
 
-		jbd_unlock_bh_state(bh_in);
+		spin_unlock(&jh_in->b_state_lock);
 		tmp = jbd2_alloc(bh_in->b_size, GFP_NOFS);
 		if (!tmp) {
 			brelse(new_bh);
 			return -ENOMEM;
 		}
-		jbd_lock_bh_state(bh_in);
+		spin_lock(&jh_in->b_state_lock);
 		if (jh_in->b_frozen_data) {
 			jbd2_free(tmp, bh_in->b_size);
 			goto repeat;
@@ -466,7 +466,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 	__jbd2_journal_file_buffer(jh_in, transaction, BJ_Shadow);
 	spin_unlock(&journal->j_list_lock);
 	set_buffer_shadow(bh_in);
-	jbd_unlock_bh_state(bh_in);
+	spin_unlock(&jh_in->b_state_lock);
 
 	return do_escape | (done_copy_out << 1);
 }
@@ -2412,6 +2412,8 @@ static struct journal_head *journal_alloc_journal_head(void)
 		ret = kmem_cache_zalloc(jbd2_journal_head_cache,
 				GFP_NOFS | __GFP_NOFAIL);
 	}
+	if (ret)
+		spin_lock_init(&ret->b_state_lock);
 	return ret;
 }
 
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index d752d7f6929e..1db5be820816 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -876,7 +876,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 
  	start_lock = jiffies;
 	lock_buffer(bh);
-	jbd_lock_bh_state(bh);
+	spin_lock(&jh->b_state_lock);
 
 	/* If it takes too long to lock the buffer, trace it */
 	time_lock = jbd2_time_diff(start_lock, jiffies);
@@ -926,7 +926,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 
 	error = -EROFS;
 	if (is_handle_aborted(handle)) {
-		jbd_unlock_bh_state(bh);
+		spin_unlock(&jh->b_state_lock);
 		goto out;
 	}
 	error = 0;
@@ -990,7 +990,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 	 */
 	if (buffer_shadow(bh)) {
 		JBUFFER_TRACE(jh, "on shadow: sleep");
-		jbd_unlock_bh_state(bh);
+		spin_unlock(&jh->b_state_lock);
 		wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE);
 		goto repeat;
 	}
@@ -1011,7 +1011,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 		JBUFFER_TRACE(jh, "generate frozen data");
 		if (!frozen_buffer) {
 			JBUFFER_TRACE(jh, "allocate memory for buffer");
-			jbd_unlock_bh_state(bh);
+			spin_unlock(&jh->b_state_lock);
 			frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size,
 						   GFP_NOFS | __GFP_NOFAIL);
 			goto repeat;
@@ -1030,7 +1030,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 	jh->b_next_transaction = transaction;
 
 done:
-	jbd_unlock_bh_state(bh);
+	spin_unlock(&jh->b_state_lock);
 
 	/*
 	 * If we are about to journal a buffer, then any revoke pending on it is
@@ -1169,7 +1169,7 @@ int jbd2_journal_get_create_access(handle_t *handle, struct buffer_head *bh)
 	 * that case: the transaction must have deleted the buffer for it to be
 	 * reused here.
 	 */
-	jbd_lock_bh_state(bh);
+	spin_lock(&jh->b_state_lock);
 	J_ASSERT_JH(jh, (jh->b_transaction == transaction ||
 		jh->b_transaction == NULL ||
 		(jh->b_transaction == journal->j_committing_transaction &&
@@ -1204,7 +1204,7 @@ int jbd2_journal_get_create_access(handle_t *handle, struct buffer_head *bh)
 		jh->b_next_transaction = transaction;
 		spin_unlock(&journal->j_list_lock);
 	}
-	jbd_unlock_bh_state(bh);
+	spin_unlock(&jh->b_state_lock);
 
 	/*
 	 * akpm: I added this.  ext3_alloc_branch can pick up new indirect
@@ -1272,13 +1272,13 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
 		committed_data = jbd2_alloc(jh2bh(jh)->b_size,
 					    GFP_NOFS|__GFP_NOFAIL);
 
-	jbd_lock_bh_state(bh);
+	spin_lock(&jh->b_state_lock);
 	if (!jh->b_committed_data) {
 		/* Copy out the current buffer contents into the
 		 * preserved, committed copy. */
 		JBUFFER_TRACE(jh, "generate b_committed data");
 		if (!committed_data) {
-			jbd_unlock_bh_state(bh);
+			spin_unlock(&jh->b_state_lock);
 			goto repeat;
 		}
 
@@ -1286,7 +1286,7 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
 		committed_data = NULL;
 		memcpy(jh->b_committed_data, bh->b_data, bh->b_size);
 	}
-	jbd_unlock_bh_state(bh);
+	spin_unlock(&jh->b_state_lock);
 out:
 	jbd2_journal_put_journal_head(jh);
 	if (unlikely(committed_data))
@@ -1387,16 +1387,16 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
 	 */
 	if (jh->b_transaction != transaction &&
 	    jh->b_next_transaction != transaction) {
-		jbd_lock_bh_state(bh);
+		spin_lock(&jh->b_state_lock);
 		J_ASSERT_JH(jh, jh->b_transaction == transaction ||
 				jh->b_next_transaction == transaction);
-		jbd_unlock_bh_state(bh);
+		spin_unlock(&jh->b_state_lock);
 	}
 	if (jh->b_modified == 1) {
 		/* If it's in our transaction it must be in BJ_Metadata list. */
 		if (jh->b_transaction == transaction &&
 		    jh->b_jlist != BJ_Metadata) {
-			jbd_lock_bh_state(bh);
+			spin_lock(&jh->b_state_lock);
 			if (jh->b_transaction == transaction &&
 			    jh->b_jlist != BJ_Metadata)
 				pr_err("JBD2: assertion failure: h_type=%u "
@@ -1406,13 +1406,13 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
 				       jh->b_jlist);
 			J_ASSERT_JH(jh, jh->b_transaction != transaction ||
 					jh->b_jlist == BJ_Metadata);
-			jbd_unlock_bh_state(bh);
+			spin_unlock(&jh->b_state_lock);
 		}
 		goto out;
 	}
 
 	journal = transaction->t_journal;
-	jbd_lock_bh_state(bh);
+	spin_lock(&jh->b_state_lock);
 
 	if (jh->b_modified == 0) {
 		/*
@@ -1498,7 +1498,7 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
 	__jbd2_journal_file_buffer(jh, transaction, BJ_Metadata);
 	spin_unlock(&journal->j_list_lock);
 out_unlock_bh:
-	jbd_unlock_bh_state(bh);
+	spin_unlock(&jh->b_state_lock);
 out:
 	JBUFFER_TRACE(jh, "exit");
 	return ret;
@@ -1536,11 +1536,13 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
 
 	BUFFER_TRACE(bh, "entry");
 
-	jbd_lock_bh_state(bh);
+	jh = jbd2_journal_grab_journal_head(bh);
+	if (!jh) {
+		__bforget(bh);
+		return 0;
+	}
 
-	if (!buffer_jbd(bh))
-		goto not_jbd;
-	jh = bh2jh(bh);
+	spin_lock(&jh->b_state_lock);
 
 	/* Critical error: attempting to delete a bitmap buffer, maybe?
 	 * Don't do any jbd operations, and return an error. */
@@ -1661,18 +1663,14 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
 		spin_unlock(&journal->j_list_lock);
 	}
 drop:
-	jbd_unlock_bh_state(bh);
 	__brelse(bh);
+	spin_unlock(&jh->b_state_lock);
+	jbd2_journal_put_journal_head(jh);
 	if (drop_reserve) {
 		/* no need to reserve log space for this block -bzzz */
 		handle->h_buffer_credits++;
 	}
 	return err;
-
-not_jbd:
-	jbd_unlock_bh_state(bh);
-	__bforget(bh);
-	goto drop;
 }
 
 /**
@@ -1871,7 +1869,7 @@ int jbd2_journal_stop(handle_t *handle)
  *
  * j_list_lock is held.
  *
- * jbd_lock_bh_state(jh2bh(jh)) is held.
+ * jh->b_state_lock is held.
  */
 
 static inline void
@@ -1895,7 +1893,7 @@ __blist_add_buffer(struct journal_head **list, struct journal_head *jh)
  *
  * Called with j_list_lock held, and the journal may not be locked.
  *
- * jbd_lock_bh_state(jh2bh(jh)) is held.
+ * jh->b_state_lock is held.
  */
 
 static inline void
@@ -1927,7 +1925,7 @@ static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh)
 	transaction_t *transaction;
 	struct buffer_head *bh = jh2bh(jh);
 
-	J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh));
+	lockdep_assert_held(&jh->b_state_lock);
 	transaction = jh->b_transaction;
 	if (transaction)
 		assert_spin_locked(&transaction->t_journal->j_list_lock);
@@ -1981,11 +1979,11 @@ void jbd2_journal_unfile_buffer(journal_t *journal, struct journal_head *jh)
 
 	/* Get reference so that buffer cannot be freed before we unlock it */
 	get_bh(bh);
-	jbd_lock_bh_state(bh);
+	spin_lock(&jh->b_state_lock);
 	spin_lock(&journal->j_list_lock);
 	__jbd2_journal_unfile_buffer(jh);
 	spin_unlock(&journal->j_list_lock);
-	jbd_unlock_bh_state(bh);
+	spin_unlock(&jh->b_state_lock);
 	jbd2_journal_put_journal_head(jh);
 	__brelse(bh);
 }
@@ -1993,7 +1991,7 @@ void jbd2_journal_unfile_buffer(journal_t *journal, struct journal_head *jh)
 /*
  * Called from jbd2_journal_try_to_free_buffers().
  *
- * Called under jbd_lock_bh_state(bh)
+ * Called under jh->b_state_lock
  */
 static void
 __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
@@ -2080,10 +2078,10 @@ int jbd2_journal_try_to_free_buffers(journal_t *journal,
 		if (!jh)
 			continue;
 
-		jbd_lock_bh_state(bh);
+		spin_lock(&jh->b_state_lock);
 		__journal_try_to_free_buffer(journal, bh);
+		spin_unlock(&jh->b_state_lock);
 		jbd2_journal_put_journal_head(jh);
-		jbd_unlock_bh_state(bh);
 		if (buffer_jbd(bh))
 			goto busy;
 	} while ((bh = bh->b_this_page) != head);
@@ -2104,7 +2102,7 @@ int jbd2_journal_try_to_free_buffers(journal_t *journal,
  *
  * Called under j_list_lock.
  *
- * Called under jbd_lock_bh_state(bh).
+ * Called under jh->b_state_lock.
  */
 static int __dispose_buffer(struct journal_head *jh, transaction_t *transaction)
 {
@@ -2198,7 +2196,7 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh,
 
 	/* OK, we have data buffer in journaled mode */
 	write_lock(&journal->j_state_lock);
-	jbd_lock_bh_state(bh);
+	spin_lock(&jh->b_state_lock);
 	spin_lock(&journal->j_list_lock);
 
 	/*
@@ -2279,10 +2277,10 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh,
 		 * for commit and try again.
 		 */
 		if (partial_page) {
-			jbd2_journal_put_journal_head(jh);
 			spin_unlock(&journal->j_list_lock);
-			jbd_unlock_bh_state(bh);
+			spin_unlock(&jh->b_state_lock);
 			write_unlock(&journal->j_state_lock);
+			jbd2_journal_put_journal_head(jh);
 			return -EBUSY;
 		}
 		/*
@@ -2294,10 +2292,10 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh,
 		set_buffer_freed(bh);
 		if (journal->j_running_transaction && buffer_jbddirty(bh))
 			jh->b_next_transaction = journal->j_running_transaction;
-		jbd2_journal_put_journal_head(jh);
 		spin_unlock(&journal->j_list_lock);
-		jbd_unlock_bh_state(bh);
+		spin_unlock(&jh->b_state_lock);
 		write_unlock(&journal->j_state_lock);
+		jbd2_journal_put_journal_head(jh);
 		return 0;
 	} else {
 		/* Good, the buffer belongs to the running transaction.
@@ -2321,10 +2319,10 @@ static int journal_unmap_buffer(journal_t *journal, struct buffer_head *bh,
 	 * here.
 	 */
 	jh->b_modified = 0;
-	jbd2_journal_put_journal_head(jh);
 	spin_unlock(&journal->j_list_lock);
-	jbd_unlock_bh_state(bh);
+	spin_unlock(&jh->b_state_lock);
 	write_unlock(&journal->j_state_lock);
+	jbd2_journal_put_journal_head(jh);
 zap_buffer_unlocked:
 	clear_buffer_dirty(bh);
 	J_ASSERT_BH(bh, !buffer_jbddirty(bh));
@@ -2411,7 +2409,7 @@ void __jbd2_journal_file_buffer(struct journal_head *jh,
 	int was_dirty = 0;
 	struct buffer_head *bh = jh2bh(jh);
 
-	J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh));
+	lockdep_assert_held(&jh->b_state_lock);
 	assert_spin_locked(&transaction->t_journal->j_list_lock);
 
 	J_ASSERT_JH(jh, jh->b_jlist < BJ_Types);
@@ -2473,11 +2471,11 @@ void __jbd2_journal_file_buffer(struct journal_head *jh,
 void jbd2_journal_file_buffer(struct journal_head *jh,
 				transaction_t *transaction, int jlist)
 {
-	jbd_lock_bh_state(jh2bh(jh));
+	spin_lock(&jh->b_state_lock);
 	spin_lock(&transaction->t_journal->j_list_lock);
 	__jbd2_journal_file_buffer(jh, transaction, jlist);
 	spin_unlock(&transaction->t_journal->j_list_lock);
-	jbd_unlock_bh_state(jh2bh(jh));
+	spin_unlock(&jh->b_state_lock);
 }
 
 /*
@@ -2487,7 +2485,7 @@ void jbd2_journal_file_buffer(struct journal_head *jh,
  * buffer on that transaction's metadata list.
  *
  * Called under j_list_lock
- * Called under jbd_lock_bh_state(jh2bh(jh))
+ * Called under jh->b_state_lock
  *
  * When this function returns true, there's no next transaction to refile to
  * and the caller has to drop jh reference through
@@ -2498,7 +2496,7 @@ bool __jbd2_journal_refile_buffer(struct journal_head *jh)
 	int was_dirty, jlist;
 	struct buffer_head *bh = jh2bh(jh);
 
-	J_ASSERT_JH(jh, jbd_is_locked_bh_state(bh));
+	lockdep_assert_held(&jh->b_state_lock);
 	if (jh->b_transaction)
 		assert_spin_locked(&jh->b_transaction->t_journal->j_list_lock);
 
@@ -2544,17 +2542,13 @@ bool __jbd2_journal_refile_buffer(struct journal_head *jh)
  */
 void jbd2_journal_refile_buffer(journal_t *journal, struct journal_head *jh)
 {
-	struct buffer_head *bh = jh2bh(jh);
 	bool drop;
 
-	/* Get reference so that buffer cannot be freed before we unlock it */
-	get_bh(bh);
-	jbd_lock_bh_state(bh);
+	spin_lock(&jh->b_state_lock);
 	spin_lock(&journal->j_list_lock);
 	drop = __jbd2_journal_refile_buffer(jh);
-	jbd_unlock_bh_state(bh);
+	spin_unlock(&jh->b_state_lock);
 	spin_unlock(&journal->j_list_lock);
-	__brelse(bh);
 	if (drop)
 		jbd2_journal_put_journal_head(jh);
 }
diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index 69c21a3843af..4180c3ef0a68 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -1252,6 +1252,7 @@ static int ocfs2_test_bg_bit_allocatable(struct buffer_head *bg_bh,
 					 int nr)
 {
 	struct ocfs2_group_desc *bg = (struct ocfs2_group_desc *) bg_bh->b_data;
+	struct journal_head *jh;
 	int ret;
 
 	if (ocfs2_test_bit(nr, (unsigned long *)bg->bg_bitmap))
@@ -1260,13 +1261,14 @@ static int ocfs2_test_bg_bit_allocatable(struct buffer_head *bg_bh,
 	if (!buffer_jbd(bg_bh))
 		return 1;
 
-	jbd_lock_bh_state(bg_bh);
-	bg = (struct ocfs2_group_desc *) bh2jh(bg_bh)->b_committed_data;
+	jh = bh2jh(bg_bh);
+	spin_lock(&jh->b_state_lock);
+	bg = (struct ocfs2_group_desc *) jh->b_committed_data;
 	if (bg)
 		ret = !ocfs2_test_bit(nr, (unsigned long *)bg->bg_bitmap);
 	else
 		ret = 1;
-	jbd_unlock_bh_state(bg_bh);
+	spin_unlock(&jh->b_state_lock);
 
 	return ret;
 }
@@ -2387,6 +2389,7 @@ static int ocfs2_block_group_clear_bits(handle_t *handle,
 	int status;
 	unsigned int tmp;
 	struct ocfs2_group_desc *undo_bg = NULL;
+	struct journal_head *jh;
 
 	/* The caller got this descriptor from
 	 * ocfs2_read_group_descriptor().  Any corruption is a code bug. */
@@ -2405,10 +2408,10 @@ static int ocfs2_block_group_clear_bits(handle_t *handle,
 		goto bail;
 	}
 
+	jh = bh2jh(group_bh);
 	if (undo_fn) {
-		jbd_lock_bh_state(group_bh);
-		undo_bg = (struct ocfs2_group_desc *)
-					bh2jh(group_bh)->b_committed_data;
+		spin_lock(&jh->b_state_lock);
+		undo_bg = (struct ocfs2_group_desc *) jh->b_committed_data;
 		BUG_ON(!undo_bg);
 	}
 
@@ -2423,7 +2426,7 @@ static int ocfs2_block_group_clear_bits(handle_t *handle,
 	le16_add_cpu(&bg->bg_free_bits_count, num_bits);
 	if (le16_to_cpu(bg->bg_free_bits_count) > le16_to_cpu(bg->bg_bits)) {
 		if (undo_fn)
-			jbd_unlock_bh_state(group_bh);
+			spin_unlock(&jh->b_state_lock);
 		return ocfs2_error(alloc_inode->i_sb, "Group descriptor # %llu has bit count %u but claims %u are freed. num_bits %d\n",
 				   (unsigned long long)le64_to_cpu(bg->bg_blkno),
 				   le16_to_cpu(bg->bg_bits),
@@ -2432,7 +2435,7 @@ static int ocfs2_block_group_clear_bits(handle_t *handle,
 	}
 
 	if (undo_fn)
-		jbd_unlock_bh_state(group_bh);
+		spin_unlock(&jh->b_state_lock);
 
 	ocfs2_journal_dirty(handle, group_bh);
 bail:
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 6aa66c42599d..00d6a69f4e74 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -313,7 +313,6 @@ enum jbd_state_bits {
 	BH_Revoked,		/* Has been revoked from the log */
 	BH_RevokeValid,		/* Revoked flag is valid */
 	BH_JBDDirty,		/* Is dirty but journaled */
-	BH_State,		/* Pins most journal_head state */
 	BH_JournalHead,		/* Pins bh->b_private and jh->b_bh */
 	BH_Shadow,		/* IO on shadow buffer is running */
 	BH_Verified,		/* Metadata block has been verified ok */
@@ -342,21 +341,6 @@ static inline struct journal_head *bh2jh(struct buffer_head *bh)
 	return bh->b_private;
 }
 
-static inline void jbd_lock_bh_state(struct buffer_head *bh)
-{
-	bit_spin_lock(BH_State, &bh->b_state);
-}
-
-static inline int jbd_is_locked_bh_state(struct buffer_head *bh)
-{
-	return bit_spin_is_locked(BH_State, &bh->b_state);
-}
-
-static inline void jbd_unlock_bh_state(struct buffer_head *bh)
-{
-	bit_spin_unlock(BH_State, &bh->b_state);
-}
-
 static inline void jbd_lock_bh_journal_head(struct buffer_head *bh)
 {
 	bit_spin_lock(BH_JournalHead, &bh->b_state);
@@ -551,9 +535,9 @@ struct transaction_chp_stats_s {
  *      ->jbd_lock_bh_journal_head()	(This is "innermost")
  *
  *    j_state_lock
- *    ->jbd_lock_bh_state()
+ *    ->b_state_lock
  *
- *    jbd_lock_bh_state()
+ *    b_state_lock
  *    ->j_list_lock
  *
  *    j_state_lock
diff --git a/include/linux/journal-head.h b/include/linux/journal-head.h
index 9fb870524314..75bc56109031 100644
--- a/include/linux/journal-head.h
+++ b/include/linux/journal-head.h
@@ -11,6 +11,8 @@
 #ifndef JOURNAL_HEAD_H_INCLUDED
 #define JOURNAL_HEAD_H_INCLUDED
 
+#include <linux/spinlock.h>
+
 typedef unsigned int		tid_t;		/* Unique transaction ID */
 typedef struct transaction_s	transaction_t;	/* Compound transaction type */
 
@@ -23,6 +25,11 @@ struct journal_head {
 	 */
 	struct buffer_head *b_bh;
 
+	/*
+	 * Protect the buffer head state
+	 */
+	spinlock_t b_state_lock;
+
 	/*
 	 * Reference count - see description in journal.c
 	 * [jbd_lock_bh_journal_head()]
@@ -30,7 +37,7 @@ struct journal_head {
 	int b_jcount;
 
 	/*
-	 * Journalling list for this buffer [jbd_lock_bh_state()]
+	 * Journalling list for this buffer [b_state_lock]
 	 * NOTE: We *cannot* combine this with b_modified into a bitfield
 	 * as gcc would then (which the C standard allows but which is
 	 * very unuseful) make 64-bit accesses to the bitfield and clobber
@@ -41,20 +48,20 @@ struct journal_head {
 	/*
 	 * This flag signals the buffer has been modified by
 	 * the currently running transaction
-	 * [jbd_lock_bh_state()]
+	 * [b_state_lock]
 	 */
 	unsigned b_modified;
 
 	/*
 	 * Copy of the buffer data frozen for writing to the log.
-	 * [jbd_lock_bh_state()]
+	 * [b_state_lock]
 	 */
 	char *b_frozen_data;
 
 	/*
 	 * Pointer to a saved copy of the buffer containing no uncommitted
 	 * deallocation references, so that allocations can avoid overwriting
-	 * uncommitted deletes. [jbd_lock_bh_state()]
+	 * uncommitted deletes. [b_state_lock]
 	 */
 	char *b_committed_data;
 
@@ -63,7 +70,7 @@ struct journal_head {
 	 * metadata: either the running transaction or the committing
 	 * transaction (if there is one).  Only applies to buffers on a
 	 * transaction's data or metadata journaling list.
-	 * [j_list_lock] [jbd_lock_bh_state()]
+	 * [j_list_lock] [b_state_lock]
 	 * Either of these locks is enough for reading, both are needed for
 	 * changes.
 	 */
@@ -73,13 +80,13 @@ struct journal_head {
 	 * Pointer to the running compound transaction which is currently
 	 * modifying the buffer's metadata, if there was already a transaction
 	 * committing it when the new transaction touched it.
-	 * [t_list_lock] [jbd_lock_bh_state()]
+	 * [t_list_lock] [b_state_lock]
 	 */
 	transaction_t *b_next_transaction;
 
 	/*
 	 * Doubly-linked list of buffers on a transaction's data, metadata or
-	 * forget queue. [t_list_lock] [jbd_lock_bh_state()]
+	 * forget queue. [t_list_lock] [b_state_lock]
 	 */
 	struct journal_head *b_tnext, *b_tprev;
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 7/7] jbd2: Free journal head outside of locked region
  2019-08-09 12:42 [PATCH 0/7 v2] jbd2: Bit spinlock conversions Jan Kara
                   ` (5 preceding siblings ...)
  2019-08-09 12:42 ` [PATCH 6/7] jbd2: Make state lock a spinlock Jan Kara
@ 2019-08-09 12:42 ` Jan Kara
  2019-10-11 12:31 ` [PATCH 0/7 v2] jbd2: Bit spinlock conversions Sebastian Andrzej Siewior
  2019-11-03 19:01 ` Theodore Y. Ts'o
  8 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2019-08-09 12:42 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Anna-Maria Gleixner, Julia Cartwright, Jan Kara,
	Jan Kara

From: Thomas Gleixner <tglx@linutronix.de>

On PREEMPT_RT bit-spinlocks have the same semantics as on PREEMPT_RT=n,
i.e. they disable preemption. That means functions which are not safe to be
called in preempt disabled context on RT trigger a might_sleep() assert.

The journal head bit spinlock is mostly held for short code sequences with
trivial RT safe functionality, except for one place:

jbd2_journal_put_journal_head() invokes __journal_remove_journal_head()
with the journal head bit spinlock held. __journal_remove_journal_head()
invokes kmem_cache_free() which must not be called with preemption disabled
on RT.

Jan suggested to rework the removal function so the actual free happens
outside the bit-spinlocked region.

Split it into two parts:

  - Do the sanity checks and the buffer head detach under the lock

  - Do the actual free after dropping the lock

There is error case handling in the free part which needs to dereference
the b_size field of the now detached buffer head. Due to paranoia (caused
by ignorance) the size is retrieved in the detach function and handed into
the free function. Might be over-engineered, but better safe than sorry.

This makes the journal head bit-spinlock usage RT compliant and also avoids
nested locking which is not covered by lockdep.

Suggested-by: Jan Kara <jack@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-ext4@vger.kernel.org
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/journal.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index eee350fef339..c79e6de2fcfc 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2533,17 +2533,23 @@ static void __journal_remove_journal_head(struct buffer_head *bh)
 	J_ASSERT_BH(bh, buffer_jbd(bh));
 	J_ASSERT_BH(bh, jh2bh(jh) == bh);
 	BUFFER_TRACE(bh, "remove journal_head");
+
+	/* Unlink before dropping the lock */
+	bh->b_private = NULL;
+	jh->b_bh = NULL;	/* debug, really */
+	clear_buffer_jbd(bh);
+}
+
+static void journal_release_journal_head(struct journal_head *jh, size_t b_size)
+{
 	if (jh->b_frozen_data) {
 		printk(KERN_WARNING "%s: freeing b_frozen_data\n", __func__);
-		jbd2_free(jh->b_frozen_data, bh->b_size);
+		jbd2_free(jh->b_frozen_data, b_size);
 	}
 	if (jh->b_committed_data) {
 		printk(KERN_WARNING "%s: freeing b_committed_data\n", __func__);
-		jbd2_free(jh->b_committed_data, bh->b_size);
+		jbd2_free(jh->b_committed_data, b_size);
 	}
-	bh->b_private = NULL;
-	jh->b_bh = NULL;	/* debug, really */
-	clear_buffer_jbd(bh);
 	journal_free_journal_head(jh);
 }
 
@@ -2561,9 +2567,11 @@ void jbd2_journal_put_journal_head(struct journal_head *jh)
 	if (!jh->b_jcount) {
 		__journal_remove_journal_head(bh);
 		jbd_unlock_bh_journal_head(bh);
+		journal_release_journal_head(jh, bh->b_size);
 		__brelse(bh);
-	} else
+	} else {
 		jbd_unlock_bh_journal_head(bh);
+	}
 }
 
 /*
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7 v2] jbd2: Bit spinlock conversions
  2019-08-09 12:42 [PATCH 0/7 v2] jbd2: Bit spinlock conversions Jan Kara
                   ` (6 preceding siblings ...)
  2019-08-09 12:42 ` [PATCH 7/7] jbd2: Free journal head outside of locked region Jan Kara
@ 2019-10-11 12:31 ` Sebastian Andrzej Siewior
  2019-11-03 19:01 ` Theodore Y. Ts'o
  8 siblings, 0 replies; 14+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-10-11 12:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ted Tso, linux-ext4, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Steven Rostedt, Anna-Maria Gleixner,
	Julia Cartwright

On 2019-08-09 14:42:26 [+0200], Jan Kara wrote:
> Hello,
Hi,

> This series is derived from Thomas' series to get rid of bit spinlocks in
> buffer head code. These patches convert BH_State bit spinlock to an ordinary
> spinlock inside struct journal_head and somewhat reduce the critical section
> under BH_JournalHead bit spinlock so that it is fine for RT. 
> 
> Motivation from original Thomas' series:
> 
> Bit spinlocks are problematic if PREEMPT_RT is enabled. They disable
> preemption, which is undesired for latency reasons and breaks when regular
> spinlocks are taken within the bit_spinlock locked region because regular
> spinlocks are converted to 'sleeping spinlocks' on RT.
> 
> Bit spinlocks are also not covered by lock debugging, e.g. lockdep. With
> the spinlock substitution in place, they can be exposed via a new config
> switch: CONFIG_DEBUG_BIT_SPINLOCKS.
> 
> Ted, can you pick up these patches? Thanks!

Has this series been postponed?

> Changes since v1:
> * Fixed up compilation breakage on UP due to missing linux/spinlock.h include
> 
> 								Honza

Sebastian

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/7] jbd2: Don't call __bforget() unnecessarily
  2019-08-09 12:42 ` [PATCH 5/7] jbd2: Don't call __bforget() unnecessarily Jan Kara
@ 2019-10-28 15:28   ` Theodore Y. Ts'o
  2019-10-28 16:01     ` Theodore Y. Ts'o
  0 siblings, 1 reply; 14+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-28 15:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Anna-Maria Gleixner, Julia Cartwright

On Fri, Aug 09, 2019 at 02:42:31PM +0200, Jan Kara wrote:
> @@ -1660,10 +1660,9 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
>  		__jbd2_journal_file_buffer(jh, transaction, BJ_Forget);
>  		spin_unlock(&journal->j_list_lock);
>  	}
> -
> +drop:
>  	jbd_unlock_bh_state(bh);
>  	__brelse(bh);
> -drop:
>  	if (drop_reserve) {
>  		/* no need to reserve log space for this block -bzzz */
>  		handle->h_buffer_credits++;

(Workflow observation: this is an example where Gerrit can be *so*
*much* *better* than e-mail review at times; sometimes, you *really*
want to see more context than what e-mail affords you.)

After this patch, the resulting code looks like this:

-----
drop:
	jbd_unlock_bh_state(bh);
	__brelse(bh);
	if (drop_reserve) {
		/* no need to reserve log space for this block -bzzz */
		handle->h_buffer_credits++;
	}
	return err;

not_jbd:
	jbd_unlock_bh_state(bh);
	__bforget(bh);
	goto drop;
----

And we still have a case we jump to not_jbd, at which point hilarity
will ensue.

This is cleaned up in the following patch in this sequence, but this
leaves us in a not-great state if we are ever bisecting.

					- Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/7] jbd2: Don't call __bforget() unnecessarily
  2019-10-28 15:28   ` Theodore Y. Ts'o
@ 2019-10-28 16:01     ` Theodore Y. Ts'o
  2019-10-30 11:49       ` Jan Kara
  0 siblings, 1 reply; 14+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-28 16:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Anna-Maria Gleixner, Julia Cartwright

On Mon, Oct 28, 2019 at 11:28:08AM -0400, Theodore Y. Ts'o wrote:
> drop:
> 	jbd_unlock_bh_state(bh);
> 	__brelse(bh);
> 	if (drop_reserve) {
> 		/* no need to reserve log space for this block -bzzz */
> 		handle->h_buffer_credits++;
> 	}
> 	return err;
> 
> not_jbd:
> 	jbd_unlock_bh_state(bh);
> 	__bforget(bh);
> 	goto drop;
> ----
> 
> And we still have a case we jump to not_jbd, at which point hilarity
> will ensue.
> 
> This is cleaned up in the following patch in this sequence, but this
> leaves us in a not-great state if we are ever bisecting.

Proposed fixup: I'll apply the following on top of this commit, and
then fix the merge conflicts in 6/7 so that the end result looks the
same as before.

Jan, any objections?  I figure this way it'll save you from resending
the patch series, since the rest of it looks fine to me.

							- Ted


diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index f2af4afc690a..c7c9a42451c7 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1541,8 +1541,11 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
 
 	jbd_lock_bh_state(bh);
 
-	if (!buffer_jbd(bh))
-		goto not_jbd;
+	if (!buffer_jbd(bh)) {
+		jbd_unlock_bh_state(bh);
+		__bforget(bh);
+		return 0;
+	}
 	jh = bh2jh(bh);
 
 	/* Critical error: attempting to delete a bitmap buffer, maybe?
@@ -1671,11 +1674,6 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
 		handle->h_buffer_credits++;
 	}
 	return err;
-
-not_jbd:
-	jbd_unlock_bh_state(bh);
-	__bforget(bh);
-	goto drop;
 }
 
 /**

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 5/7] jbd2: Don't call __bforget() unnecessarily
  2019-10-28 16:01     ` Theodore Y. Ts'o
@ 2019-10-30 11:49       ` Jan Kara
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2019-10-30 11:49 UTC (permalink / raw)
  To: Theodore Y. Ts'o
  Cc: Jan Kara, linux-ext4, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Steven Rostedt, Anna-Maria Gleixner,
	Julia Cartwright

On Mon 28-10-19 12:01:36, Theodore Y. Ts'o wrote:
> On Mon, Oct 28, 2019 at 11:28:08AM -0400, Theodore Y. Ts'o wrote:
> > drop:
> > 	jbd_unlock_bh_state(bh);
> > 	__brelse(bh);
> > 	if (drop_reserve) {
> > 		/* no need to reserve log space for this block -bzzz */
> > 		handle->h_buffer_credits++;
> > 	}
> > 	return err;
> > 
> > not_jbd:
> > 	jbd_unlock_bh_state(bh);
> > 	__bforget(bh);
> > 	goto drop;
> > ----
> > 
> > And we still have a case we jump to not_jbd, at which point hilarity
> > will ensue.
> > 
> > This is cleaned up in the following patch in this sequence, but this
> > leaves us in a not-great state if we are ever bisecting.
> 
> Proposed fixup: I'll apply the following on top of this commit, and
> then fix the merge conflicts in 6/7 so that the end result looks the
> same as before.
> 
> Jan, any objections?  I figure this way it'll save you from resending
> the patch series, since the rest of it looks fine to me.

Yeah, looks good to me. Thanks for the fixup!

								Honza

> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index f2af4afc690a..c7c9a42451c7 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1541,8 +1541,11 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
>  
>  	jbd_lock_bh_state(bh);
>  
> -	if (!buffer_jbd(bh))
> -		goto not_jbd;
> +	if (!buffer_jbd(bh)) {
> +		jbd_unlock_bh_state(bh);
> +		__bforget(bh);
> +		return 0;
> +	}
>  	jh = bh2jh(bh);
>  
>  	/* Critical error: attempting to delete a bitmap buffer, maybe?
> @@ -1671,11 +1674,6 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
>  		handle->h_buffer_credits++;
>  	}
>  	return err;
> -
> -not_jbd:
> -	jbd_unlock_bh_state(bh);
> -	__bforget(bh);
> -	goto drop;
>  }
>  
>  /**
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/7 v2] jbd2: Bit spinlock conversions
  2019-08-09 12:42 [PATCH 0/7 v2] jbd2: Bit spinlock conversions Jan Kara
                   ` (7 preceding siblings ...)
  2019-10-11 12:31 ` [PATCH 0/7 v2] jbd2: Bit spinlock conversions Sebastian Andrzej Siewior
@ 2019-11-03 19:01 ` Theodore Y. Ts'o
  8 siblings, 0 replies; 14+ messages in thread
From: Theodore Y. Ts'o @ 2019-11-03 19:01 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-ext4, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Anna-Maria Gleixner, Julia Cartwright

On Fri, Aug 09, 2019 at 02:42:26PM +0200, Jan Kara wrote:
> Hello,
> 
> This series is derived from Thomas' series to get rid of bit spinlocks in
> buffer head code. These patches convert BH_State bit spinlock to an ordinary
> spinlock inside struct journal_head and somewhat reduce the critical section
> under BH_JournalHead bit spinlock so that it is fine for RT. 
> 
> Motivation from original Thomas' series:
> 
> Bit spinlocks are problematic if PREEMPT_RT is enabled. They disable
> preemption, which is undesired for latency reasons and breaks when regular
> spinlocks are taken within the bit_spinlock locked region because regular
> spinlocks are converted to 'sleeping spinlocks' on RT.
> 
> Bit spinlocks are also not covered by lock debugging, e.g. lockdep. With
> the spinlock substitution in place, they can be exposed via a new config
> switch: CONFIG_DEBUG_BIT_SPINLOCKS.
> 
> Ted, can you pick up these patches? Thanks!

I've landed this patch set on the ext4 git tree.  Thanks!

     	    	       	      	       - Ted

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 7/7] jbd2: Free journal head outside of locked region
  2019-08-02 15:13 [PATCH 0/7] " Jan Kara
@ 2019-08-02 15:13 ` Jan Kara
  0 siblings, 0 replies; 14+ messages in thread
From: Jan Kara @ 2019-08-02 15:13 UTC (permalink / raw)
  To: Ted Tso
  Cc: linux-ext4, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Anna-Maria Gleixner, Julia Cartwright, Jan Kara,
	Jan Kara

From: Thomas Gleixner <tglx@linutronix.de>

On PREEMPT_RT bit-spinlocks have the same semantics as on PREEMPT_RT=n,
i.e. they disable preemption. That means functions which are not safe to be
called in preempt disabled context on RT trigger a might_sleep() assert.

The journal head bit spinlock is mostly held for short code sequences with
trivial RT safe functionality, except for one place:

jbd2_journal_put_journal_head() invokes __journal_remove_journal_head()
with the journal head bit spinlock held. __journal_remove_journal_head()
invokes kmem_cache_free() which must not be called with preemption disabled
on RT.

Jan suggested to rework the removal function so the actual free happens
outside the bit-spinlocked region.

Split it into two parts:

  - Do the sanity checks and the buffer head detach under the lock

  - Do the actual free after dropping the lock

There is error case handling in the free part which needs to dereference
the b_size field of the now detached buffer head. Due to paranoia (caused
by ignorance) the size is retrieved in the detach function and handed into
the free function. Might be over-engineered, but better safe than sorry.

This makes the journal head bit-spinlock usage RT compliant and also avoids
nested locking which is not covered by lockdep.

Suggested-by: Jan Kara <jack@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-ext4@vger.kernel.org
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Jan Kara <jack@suse.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/journal.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index eee350fef339..c79e6de2fcfc 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2533,17 +2533,23 @@ static void __journal_remove_journal_head(struct buffer_head *bh)
 	J_ASSERT_BH(bh, buffer_jbd(bh));
 	J_ASSERT_BH(bh, jh2bh(jh) == bh);
 	BUFFER_TRACE(bh, "remove journal_head");
+
+	/* Unlink before dropping the lock */
+	bh->b_private = NULL;
+	jh->b_bh = NULL;	/* debug, really */
+	clear_buffer_jbd(bh);
+}
+
+static void journal_release_journal_head(struct journal_head *jh, size_t b_size)
+{
 	if (jh->b_frozen_data) {
 		printk(KERN_WARNING "%s: freeing b_frozen_data\n", __func__);
-		jbd2_free(jh->b_frozen_data, bh->b_size);
+		jbd2_free(jh->b_frozen_data, b_size);
 	}
 	if (jh->b_committed_data) {
 		printk(KERN_WARNING "%s: freeing b_committed_data\n", __func__);
-		jbd2_free(jh->b_committed_data, bh->b_size);
+		jbd2_free(jh->b_committed_data, b_size);
 	}
-	bh->b_private = NULL;
-	jh->b_bh = NULL;	/* debug, really */
-	clear_buffer_jbd(bh);
 	journal_free_journal_head(jh);
 }
 
@@ -2561,9 +2567,11 @@ void jbd2_journal_put_journal_head(struct journal_head *jh)
 	if (!jh->b_jcount) {
 		__journal_remove_journal_head(bh);
 		jbd_unlock_bh_journal_head(bh);
+		journal_release_journal_head(jh, bh->b_size);
 		__brelse(bh);
-	} else
+	} else {
 		jbd_unlock_bh_journal_head(bh);
+	}
 }
 
 /*
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-11-03 19:02 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-09 12:42 [PATCH 0/7 v2] jbd2: Bit spinlock conversions Jan Kara
2019-08-09 12:42 ` [PATCH 1/7] jbd2: Simplify journal_unmap_buffer() Jan Kara
2019-08-09 12:42 ` [PATCH 2/7] jbd2: Remove jbd_trylock_bh_state() Jan Kara
2019-08-09 12:42 ` [PATCH 3/7] jbd2: Move dropping of jh reference out of un/re-filing functions Jan Kara
2019-08-09 12:42 ` [PATCH 4/7] jbd2: Drop unnecessary branch from jbd2_journal_forget() Jan Kara
2019-08-09 12:42 ` [PATCH 5/7] jbd2: Don't call __bforget() unnecessarily Jan Kara
2019-10-28 15:28   ` Theodore Y. Ts'o
2019-10-28 16:01     ` Theodore Y. Ts'o
2019-10-30 11:49       ` Jan Kara
2019-08-09 12:42 ` [PATCH 6/7] jbd2: Make state lock a spinlock Jan Kara
2019-08-09 12:42 ` [PATCH 7/7] jbd2: Free journal head outside of locked region Jan Kara
2019-10-11 12:31 ` [PATCH 0/7 v2] jbd2: Bit spinlock conversions Sebastian Andrzej Siewior
2019-11-03 19:01 ` Theodore Y. Ts'o
  -- strict thread matches above, loose matches on Subject: below --
2019-08-02 15:13 [PATCH 0/7] " Jan Kara
2019-08-02 15:13 ` [PATCH 7/7] jbd2: Free journal head outside of locked region Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).