All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled
@ 2015-04-02 13:58 Jan Kara
  2015-04-02 13:58 ` [PATCH 1/5] jbd2: Simplify code flow in do_get_write_access() Jan Kara
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Jan Kara @ 2015-04-02 13:58 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

  Hello,

  this patch set improves do_get_write_access(), jbd2_journal_get_undo_access(),
and jbd2_journal_dirty_metadata() to be completely lockless in case buffer
is already part of an appropriate journalling list. First three patches
are independent small cleanups so they can go in right away I think.

The other two patches *should* improve the situation for frequent bitmap
or inode table block updates. But frankly, I haven't been able to come up
with a load where I'd see significant contention on update of a single buffer
(or it's hidden by a larger lock). Similarly we could see improvements when
do_get_write_access() would be waiting for buffer lock because buffer is
being written out by checkpointing code. But again I wasn't able to hit this
reliably.

Ted, you mentioned at Vault you had a setup where frequent
do_get_write_access() calls were contending in the revoke code. What was the
load exactly? These patches should improve that as well...

								Honza

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/5] jbd2: Simplify code flow in do_get_write_access()
  2015-04-02 13:58 [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled Jan Kara
@ 2015-04-02 13:58 ` Jan Kara
  2015-06-08 16:39   ` Theodore Ts'o
  2015-04-02 13:58 ` [PATCH 2/5] jbd2: Simplify error path on allocation failure " Jan Kara
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2015-04-02 13:58 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

needs_copy is set only in one place in do_get_write_access(), just move
the frozen buffer copying into that place and factor it out to a
separate function to make do_get_write_access() slightly more readable.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 49 +++++++++++++++++++++++++------------------------
 1 file changed, 25 insertions(+), 24 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 5f09370c90a8..2b75f0f109be 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -763,6 +763,30 @@ static void warn_dirty_buffer(struct buffer_head *bh)
 	       bdevname(bh->b_bdev, b), (unsigned long long)bh->b_blocknr);
 }
 
+/* Call t_frozen trigger and copy buffer data into jh->b_frozen_data. */
+static void jbd2_freeze_jh_data(struct journal_head *jh)
+{
+	struct page *page;
+	int offset;
+	char *source;
+	struct buffer_head *bh = jh2bh(jh);
+
+	J_EXPECT_JH(jh, buffer_uptodate(bh), "Possible IO failure.\n");
+	page = bh->b_page;
+	offset = offset_in_page(bh->b_data);
+	source = kmap_atomic(page);
+	/* Fire data frozen trigger just before we copy the data */
+	jbd2_buffer_frozen_trigger(jh, source + offset, jh->b_triggers);
+	memcpy(jh->b_frozen_data, source + offset, bh->b_size);
+	kunmap_atomic(source);
+
+	/*
+	 * Now that the frozen data is saved off, we need to store any matching
+	 * triggers.
+	 */
+	jh->b_frozen_triggers = jh->b_triggers;
+}
+
 /*
  * If the buffer is already part of the current transaction, then there
  * is nothing we need to do.  If it is already part of a prior
@@ -782,7 +806,6 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 	journal_t *journal;
 	int error;
 	char *frozen_buffer = NULL;
-	int need_copy = 0;
 	unsigned long start_lock, time_lock;
 
 	WARN_ON(!transaction);
@@ -940,7 +963,7 @@ repeat:
 			}
 			jh->b_frozen_data = frozen_buffer;
 			frozen_buffer = NULL;
-			need_copy = 1;
+			jbd2_freeze_jh_data(jh);
 		}
 		jh->b_next_transaction = transaction;
 	}
@@ -961,28 +984,6 @@ repeat:
 	}
 
 done:
-	if (need_copy) {
-		struct page *page;
-		int offset;
-		char *source;
-
-		J_EXPECT_JH(jh, buffer_uptodate(jh2bh(jh)),
-			    "Possible IO failure.\n");
-		page = jh2bh(jh)->b_page;
-		offset = offset_in_page(jh2bh(jh)->b_data);
-		source = kmap_atomic(page);
-		/* Fire data frozen trigger just before we copy the data */
-		jbd2_buffer_frozen_trigger(jh, source + offset,
-					   jh->b_triggers);
-		memcpy(jh->b_frozen_data, source+offset, jh2bh(jh)->b_size);
-		kunmap_atomic(source);

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/5] jbd2: Simplify error path on allocation failure in do_get_write_access()
  2015-04-02 13:58 [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled Jan Kara
  2015-04-02 13:58 ` [PATCH 1/5] jbd2: Simplify code flow in do_get_write_access() Jan Kara
@ 2015-04-02 13:58 ` Jan Kara
  2015-06-08 16:42   ` Theodore Ts'o
  2015-04-02 13:58 ` [PATCH 3/5] jbd2: Simplify code flow " Jan Kara
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2015-04-02 13:58 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

We were acquiring bh_state_lock when allocation of buffer failed in
do_get_write_access() only to be able to jump to a label that releases
the lock and does all other checks that don't make sense for this error
path. Just jump into the right label instead.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 2b75f0f109be..1995ea539e96 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -956,8 +956,7 @@ repeat:
 					       __func__);
 					JBUFFER_TRACE(jh, "oom!");
 					error = -ENOMEM;
-					jbd_lock_bh_state(bh);
-					goto done;
+					goto out;
 				}
 				goto repeat;
 			}
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/5] jbd2: Simplify code flow in do_get_write_access()
  2015-04-02 13:58 [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled Jan Kara
  2015-04-02 13:58 ` [PATCH 1/5] jbd2: Simplify code flow in do_get_write_access() Jan Kara
  2015-04-02 13:58 ` [PATCH 2/5] jbd2: Simplify error path on allocation failure " Jan Kara
@ 2015-04-02 13:58 ` Jan Kara
  2015-06-08 16:45   ` Theodore Ts'o
  2015-04-02 13:58 ` [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access() Jan Kara
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2015-04-02 13:58 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

Check for the simple case of unjournaled buffer first, handle it and
bail out. This allows us to remove one if and unindent the difficult case
by one tab. The result is easier to read.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 130 +++++++++++++++++++++++---------------------------
 1 file changed, 59 insertions(+), 71 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 1995ea539e96..5207825d1038 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -893,6 +893,20 @@ repeat:
        jh->b_modified = 0;
 
 	/*
+	 * If the buffer is not journaled right now, we need to make sure it
+	 * doesn't get written to disk before the caller actually commits the
+	 * new data
+	 */
+	if (!jh->b_transaction) {
+		JBUFFER_TRACE(jh, "no transaction");
+		J_ASSERT_JH(jh, !jh->b_next_transaction);
+		JBUFFER_TRACE(jh, "file as BJ_Reserved");
+		spin_lock(&journal->j_list_lock);
+		__jbd2_journal_file_buffer(jh, transaction, BJ_Reserved);
+		spin_unlock(&journal->j_list_lock);
+		goto done;
+	}
+	/*
 	 * If there is already a copy-out version of this buffer, then we don't
 	 * need to make another one
 	 */
@@ -903,84 +917,58 @@ repeat:
 		goto done;
 	}
 
-	/* Is there data here we need to preserve? */
+	JBUFFER_TRACE(jh, "owned by older transaction");
+	J_ASSERT_JH(jh, jh->b_next_transaction == NULL);
+	J_ASSERT_JH(jh, jh->b_transaction == journal->j_committing_transaction);
 
-	if (jh->b_transaction && jh->b_transaction != transaction) {
-		JBUFFER_TRACE(jh, "owned by older transaction");
-		J_ASSERT_JH(jh, jh->b_next_transaction == NULL);
-		J_ASSERT_JH(jh, jh->b_transaction ==
-					journal->j_committing_transaction);
+	/*
+	 * There is one case we have to be very careful about.  If the
+	 * committing transaction is currently writing this buffer out to disk
+	 * and has NOT made a copy-out, then we cannot modify the buffer
+	 * contents at all right now.  The essence of copy-out is that it is
+	 * the extra copy, not the primary copy, which gets journaled.  If the
+	 * primary copy is already going to disk then we cannot do copy-out
+	 * here.
+	 */
+	if (buffer_shadow(bh)) {
+		JBUFFER_TRACE(jh, "on shadow: sleep");
+		jbd_unlock_bh_state(bh);
+		wait_on_bit_io(&bh->b_state, BH_Shadow, TASK_UNINTERRUPTIBLE);
+		goto repeat;
+	}
 
-		/* There is one case we have to be very careful about.
-		 * If the committing transaction is currently writing
-		 * this buffer out to disk and has NOT made a copy-out,
-		 * then we cannot modify the buffer contents at all
-		 * right now.  The essence of copy-out is that it is the
-		 * extra copy, not the primary copy, which gets
-		 * journaled.  If the primary copy is already going to
-		 * disk then we cannot do copy-out here. */
-
-		if (buffer_shadow(bh)) {
-			JBUFFER_TRACE(jh, "on shadow: sleep");
+	/*
+	 * Only do the copy if the currently-owning transaction still needs it.
+	 * If buffer isn't on BJ_Metadata list, the committing transaction is
+	 * past that stage (here we use the fact that BH_Shadow is set under
+	 * bh_state lock together with refiling to BJ_Shadow list and at this
+	 * point we know the buffer doesn't have BH_Shadow set).
+	 *
+	 * Subtle point, though: if this is a get_undo_access, then we will be
+	 * relying on the frozen_data to contain the new value of the
+	 * committed_data record after the transaction, so we HAVE to force the
+	 * frozen_data copy in that case.
+	 */
+	if (jh->b_jlist == BJ_Metadata || force_copy) {
+		JBUFFER_TRACE(jh, "generate frozen data");
+		if (!frozen_buffer) {
+			JBUFFER_TRACE(jh, "allocate memory for buffer");
 			jbd_unlock_bh_state(bh);
-			wait_on_bit_io(&bh->b_state, BH_Shadow,
-				       TASK_UNINTERRUPTIBLE);
-			goto repeat;
-		}
-
-		/*
-		 * Only do the copy if the currently-owning transaction still
-		 * needs it. If buffer isn't on BJ_Metadata list, the
-		 * committing transaction is past that stage (here we use the
-		 * fact that BH_Shadow is set under bh_state lock together with
-		 * refiling to BJ_Shadow list and at this point we know the
-		 * buffer doesn't have BH_Shadow set).
-		 *
-		 * Subtle point, though: if this is a get_undo_access,
-		 * then we will be relying on the frozen_data to contain
-		 * the new value of the committed_data record after the
-		 * transaction, so we HAVE to force the frozen_data copy
-		 * in that case.
-		 */
-		if (jh->b_jlist == BJ_Metadata || force_copy) {
-			JBUFFER_TRACE(jh, "generate frozen data");
+			frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS);
 			if (!frozen_buffer) {
-				JBUFFER_TRACE(jh, "allocate memory for buffer");
-				jbd_unlock_bh_state(bh);
-				frozen_buffer =
-					jbd2_alloc(jh2bh(jh)->b_size,
-							 GFP_NOFS);
-				if (!frozen_buffer) {
-					printk(KERN_ERR
-					       "%s: OOM for frozen_buffer\n",
-					       __func__);
-					JBUFFER_TRACE(jh, "oom!");
-					error = -ENOMEM;
-					goto out;
-				}
-				goto repeat;
+				printk(KERN_ERR "%s: OOM for frozen_buffer\n",
+				       __func__);
+				JBUFFER_TRACE(jh, "oom!");
+				error = -ENOMEM;
+				goto out;
 			}
-			jh->b_frozen_data = frozen_buffer;
-			frozen_buffer = NULL;
-			jbd2_freeze_jh_data(jh);
+			goto repeat;
 		}
-		jh->b_next_transaction = transaction;
-	}
-

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()
  2015-04-02 13:58 [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled Jan Kara
                   ` (2 preceding siblings ...)
  2015-04-02 13:58 ` [PATCH 3/5] jbd2: Simplify code flow " Jan Kara
@ 2015-04-02 13:58 ` Jan Kara
  2015-06-08 16:47   ` Theodore Ts'o
  2015-04-02 13:58 ` [PATCH 5/5] jbd2: Speedup jbd2_journal_dirty_metadata() Jan Kara
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2015-04-02 13:58 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are
frequently called for buffers that are already part of the running
transaction - most frequently it is the case for bitmaps, inode table
blocks, and superblock. Since in such cases we have nothing to do, it is
unfortunate we still grab reference to journal head, lock the bh, lock
bh_state only to find out there's nothing to do.

Improving this is a bit subtle though since until we find out journal
head is attached to the running transaction, it can disappear from under
us because checkpointing / commit decided it's no longer needed. We deal
with this by protecting journal_head slab with RCU. We still have to be
careful about journal head being freed & reallocated within slab and
about exposing journal head in consistent state (in particular
b_modified and b_frozen_data must be in correct state before we allow
user to touch the buffer).

FIXME: Performance data.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/journal.c     |  2 +-
 fs/jbd2/transaction.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 73 insertions(+), 5 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index b96bd8076b70..f29872ed4097 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2330,7 +2330,7 @@ static int jbd2_journal_init_journal_head_cache(void)
 	jbd2_journal_head_cache = kmem_cache_create("jbd2_journal_head",
 				sizeof(struct journal_head),
 				0,		/* offset */
-				SLAB_TEMPORARY,	/* flags */
+				SLAB_TEMPORARY | SLAB_DESTROY_BY_RCU,
 				NULL);		/* ctor */
 	retval = 0;
 	if (!jbd2_journal_head_cache) {
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 5207825d1038..a91f639af6c3 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -901,6 +901,12 @@ repeat:
 		JBUFFER_TRACE(jh, "no transaction");
 		J_ASSERT_JH(jh, !jh->b_next_transaction);
 		JBUFFER_TRACE(jh, "file as BJ_Reserved");
+		/*
+		 * Make sure all stores to jh (b_modified, b_frozen_data) are
+		 * visible before attaching it to the running transaction.
+		 * Paired with barrier in jbd2_write_access_granted()
+		 */
+		smp_wmb();
 		spin_lock(&journal->j_list_lock);
 		__jbd2_journal_file_buffer(jh, transaction, BJ_Reserved);
 		spin_unlock(&journal->j_list_lock);
@@ -913,8 +919,7 @@ repeat:
 	if (jh->b_frozen_data) {
 		JBUFFER_TRACE(jh, "has frozen data");
 		J_ASSERT_JH(jh, jh->b_next_transaction == NULL);
-		jh->b_next_transaction = transaction;
-		goto done;
+		goto attach_next;
 	}
 
 	JBUFFER_TRACE(jh, "owned by older transaction");
@@ -968,6 +973,13 @@ repeat:
 		frozen_buffer = NULL;
 		jbd2_freeze_jh_data(jh);
 	}
+attach_next:
+	/*
+	 * Make sure all stores to jh (b_modified, b_frozen_data) are visible
+	 * before attaching it to the running transaction. Paired with barrier
+	 * in jbd2_write_access_granted()
+	 */
+	smp_wmb();
 	jh->b_next_transaction = transaction;
 
 done:
@@ -987,6 +999,55 @@ out:
 	return error;
 }
 
+/* Fast check whether buffer is already attached to the required transaction */
+static bool jbd2_write_access_granted(handle_t *handle, struct buffer_head *bh)
+{
+	struct journal_head *jh;
+	bool ret = false;
+
+	/* Dirty buffers require special handling... */
+	if (buffer_dirty(bh))
+		return false;
+
+	/*
+	 * RCU protects us from dereferencing freed pages. So the checks we do
+	 * are guaranteed not to oops. However the jh slab object can get freed
+	 * & reallocated while we work with it. So we have to be careful. When
+	 * we see jh attached to the running transaction, we know it must stay
+	 * so until the transaction is committed. Thus jh won't be freed and
+	 * will be attached to the same bh while we run.  However it can
+	 * happen jh gets freed, reallocated, and attached to the transaction
+	 * just after we get pointer to it from bh. So we have to be careful
+	 * and recheck jh still belongs to our bh before we return success.
+	 */
+	rcu_read_lock();
+	if (!buffer_jbd(bh))
+		goto out;
+	/* This should be bh2jh() but that doesn't work with inline functions */
+	jh = READ_ONCE(bh->b_private);
+	if (!jh)
+		goto out;
+	if (jh->b_transaction != handle->h_transaction &&
+	    jh->b_next_transaction != handle->h_transaction)
+		goto out;
+	/*
+	 * There are two reasons for the barrier here:
+	 * 1) Make sure to fetch b_bh after we did previous checks so that we
+	 * detect when jh went through free, realloc, attach to transaction
+	 * while we were checking. Paired with implicit barrier in that path.
+	 * 2) So that access to bh done after jbd2_write_access_granted()
+	 * doesn't get reordered and see inconsistent state of concurrent
+	 * do_get_write_access().
+	 */
+	smp_mb();
+	if (unlikely(jh->b_bh != bh))
+		goto out;
+	ret = true;
+out:
+	rcu_read_unlock();
+	return ret;
+}
+
 /**
  * int jbd2_journal_get_write_access() - notify intent to modify a buffer for metadata (not data) update.
  * @handle: transaction to add buffer modifications to
@@ -1000,9 +1061,13 @@ out:
 
 int jbd2_journal_get_write_access(handle_t *handle, struct buffer_head *bh)
 {
-	struct journal_head *jh = jbd2_journal_add_journal_head(bh);
+	struct journal_head *jh;
 	int rc;
 
+	if (jbd2_write_access_granted(handle, bh))
+		return 0;
+
+	jh = jbd2_journal_add_journal_head(bh);
 	/* We do not want to get caught playing with fields which the
 	 * log thread also manipulates.  Make sure that the buffer
 	 * completes any outstanding IO before proceeding. */
@@ -1133,11 +1198,14 @@ out:
 int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh)
 {
 	int err;
-	struct journal_head *jh = jbd2_journal_add_journal_head(bh);
+	struct journal_head *jh;
 	char *committed_data = NULL;
 
 	JBUFFER_TRACE(jh, "entry");
+	if (jbd2_write_access_granted(handle, bh))
+		return 0;
 
+	jh = jbd2_journal_add_journal_head(bh);
 	/*
 	 * Do this first --- it can drop the journal lock, so we want to
 	 * make sure that obtaining the committed_data is done
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/5] jbd2: Speedup jbd2_journal_dirty_metadata()
  2015-04-02 13:58 [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled Jan Kara
                   ` (3 preceding siblings ...)
  2015-04-02 13:58 ` [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access() Jan Kara
@ 2015-04-02 13:58 ` Jan Kara
  2015-06-08 16:50   ` Theodore Ts'o
  2015-04-02 14:23 ` [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled Theodore Ts'o
  2015-04-12 10:09 ` Dmitry Monakhov
  6 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2015-04-02 13:58 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

It is often the case that we mark buffer as having dirty metadata when
the buffer is already in that state (frequent for bitmaps, inode table
blocks, superblock). Thus it is unnecessary to contend on grabbing
journal head reference and bh_state lock. Avoid that by checking whether
any modification to the buffer is needed before grabbing any locks or
references.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 22 ++++++++++++++++------
 1 file changed, 16 insertions(+), 6 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a91f639af6c3..ad10ca8fb9ef 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1290,8 +1290,6 @@ void jbd2_buffer_abort_trigger(struct journal_head *jh,
 	triggers->t_abort(triggers, jh2bh(jh));
 }
 
-
-
 /**
  * int jbd2_journal_dirty_metadata() -  mark a buffer as containing dirty metadata
  * @handle: transaction to add buffer to.
@@ -1325,12 +1323,25 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
 	WARN_ON(!transaction);
 	if (is_handle_aborted(handle))
 		return -EROFS;
-	journal = transaction->t_journal;
-	jh = jbd2_journal_grab_journal_head(bh);
-	if (!jh) {
+	if (!buffer_jbd(bh)) {
 		ret = -EUCLEAN;
 		goto out;
 	}
+	/*
+	 * We don't grab jh reference here since the buffer must be part
+	 * of the running transaction.
+	 */
+	jh = bh2jh(bh);
+	J_ASSERT_JH(jh, jh->b_transaction == transaction ||
+			jh->b_next_transaction == transaction);
+	if (jh->b_modified == 1) {
+		/* If it's in our transaction it must be in BJ_Metadata list */
+		J_ASSERT_JH(jh, jh->b_transaction != transaction ||
+				jh->b_jlist == BJ_Metadata);
+		goto out;
+	}
+
+	journal = transaction->t_journal;
 	jbd_debug(5, "journal_head %p\n", jh);
 	JBUFFER_TRACE(jh, "entry");
 
@@ -1421,7 +1432,6 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
 	spin_unlock(&journal->j_list_lock);
 out_unlock_bh:
 	jbd_unlock_bh_state(bh);
-	jbd2_journal_put_journal_head(jh);
 out:
 	JBUFFER_TRACE(jh, "exit");
 	return ret;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled
  2015-04-02 13:58 [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled Jan Kara
                   ` (4 preceding siblings ...)
  2015-04-02 13:58 ` [PATCH 5/5] jbd2: Speedup jbd2_journal_dirty_metadata() Jan Kara
@ 2015-04-02 14:23 ` Theodore Ts'o
  2015-04-12 10:09 ` Dmitry Monakhov
  6 siblings, 0 replies; 20+ messages in thread
From: Theodore Ts'o @ 2015-04-02 14:23 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Thu, Apr 02, 2015 at 03:58:15PM +0200, Jan Kara wrote:
> 
>   this patch set improves do_get_write_access(), jbd2_journal_get_undo_access(),
> and jbd2_journal_dirty_metadata() to be completely lockless in case buffer
> is already part of an appropriate journalling list. First three patches
> are independent small cleanups so they can go in right away I think.
> 
> The other two patches *should* improve the situation for frequent bitmap
> or inode table block updates. But frankly, I haven't been able to come up
> with a load where I'd see significant contention on update of a single buffer
> (or it's hidden by a larger lock). Similarly we could see improvements when
> do_get_write_access() would be waiting for buffer lock because buffer is
> being written out by checkpointing code. But again I wasn't able to hit this
> reliably.
> 
> Ted, you mentioned at Vault you had a setup where frequent
> do_get_write_access() calls were contending in the revoke code. What was the
> load exactly? These patches should improve that as well...

Use a 32-core Intel processor with 128GB memory; create a 32GB ram
disk, but ext4 on it, and then run your favorite scalability workload
on it.  I used a random 4k write workload, and noted that we were
calling start_handle() all the time.  This was fixed in dioread_nolock
since we check to see if it's an overwrite.

I'll have to look at this again, but I remember thinking that we could
push the overwrite check down a level, and with a few other tweaks,
end up fixing the AIO race condition you were worrying about it, as
well as skipping the start_handle() call in the case where we know
we're doing an overwrite in all cases, not just dioread_nolock.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled
  2015-04-02 13:58 [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled Jan Kara
                   ` (5 preceding siblings ...)
  2015-04-02 14:23 ` [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled Theodore Ts'o
@ 2015-04-12 10:09 ` Dmitry Monakhov
  2015-04-16 10:46   ` Jan Kara
  6 siblings, 1 reply; 20+ messages in thread
From: Dmitry Monakhov @ 2015-04-12 10:09 UTC (permalink / raw)
  To: Jan Kara, linux-ext4; +Cc: Ted Tso, Jan Kara


Jan Kara <jack@suse.cz> writes:

>   Hello,
>
>   this patch set improves do_get_write_access(), jbd2_journal_get_undo_access(),
> and jbd2_journal_dirty_metadata() to be completely lockless in case buffer
> is already part of an appropriate journalling list. First three patches
> are independent small cleanups so they can go in right away I think.
>
> The other two patches *should* improve the situation for frequent bitmap
> or inode table block updates. But frankly, I haven't been able to come up
> with a load where I'd see significant contention on update of a single buffer
> (or it's hidden by a larger lock). Similarly we could see improvements when
> do_get_write_access() would be waiting for buffer lock because buffer is
> being written out by checkpointing code. But again I wasn't able to hit this
> reliably.
One of most annoying performance issues was unpredictable latency of aio submission
This is typical workload on chunk server (object storage, cloud storage,
ceph) where one some tasks performs  aio/dio submission and other
performs fsync(). Some times we got this io_submit->touch_mtime()->
do_get_write_access() observes that jh->b_jlist == BJ_Shadow and wait
for transaction commit. So aio-dio submission can block (even if file
was previously allocated) for a long time(1-5sec) on ext3/4
But this was fixed by 'lazytime' option

#Simplified testcase
#BAD workload which provoke endless fsync->commit_transaction
while true; do 
      xfs_io -c "pwrite -b 1M 1M 32M" \
	-f t{1,2,3,5,6,7,8,9,10};
      xfs_io -c "pwrite -b 1M 1M 1M" -c \
	"fsync" -f -d t11

# Measure aio-dio latency
[root@alice Z]# uname -a
Linux alice.qa.sw.ru 4.0.0-rc7+ #13 SMP Sun Apr 12 00:34:51 MSK 2015
x86_64 x86_64 x86_64 GNU/Linux
[root@alice Z]# ioping -A  -C -D  -WWW t
4.0 KiB from t (ext4 /dev/sdb1): request=1 time=441 us
...
4.0 KiB from t (ext4 /dev/sdb1): request=12 time=393 us
4.0 KiB from t (ext4 /dev/sdb1): request=13 time=2.7 s <---- too long
4.0 KiB from t (ext4 /dev/sdb1): request=14 time=397 us
4.0 KiB from t (ext4 /dev/sdb1): request=15 time=398 us
^C
--- t (ext4 /dev/sdb1) ioping statistics ---
15 requests completed in 17.2 s, 5 iops, 22.0 KiB/s
min/avg/max/mdev = 384 us / 182.2 ms / 2.7 s / 679.1 ms
>
> Ted, you mentioned at Vault you had a setup where frequent
> do_get_write_access() calls were contending in the revoke code. What was the
> load exactly? These patches should improve that as well...
>
> 								Honza
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled
  2015-04-12 10:09 ` Dmitry Monakhov
@ 2015-04-16 10:46   ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2015-04-16 10:46 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: Jan Kara, linux-ext4, Ted Tso

On Sun 12-04-15 14:09:14, Dmitry Monakhov wrote:
> 
> Jan Kara <jack@suse.cz> writes:
> 
> >   Hello,
> >
> >   this patch set improves do_get_write_access(), jbd2_journal_get_undo_access(),
> > and jbd2_journal_dirty_metadata() to be completely lockless in case buffer
> > is already part of an appropriate journalling list. First three patches
> > are independent small cleanups so they can go in right away I think.
> >
> > The other two patches *should* improve the situation for frequent bitmap
> > or inode table block updates. But frankly, I haven't been able to come up
> > with a load where I'd see significant contention on update of a single buffer
> > (or it's hidden by a larger lock). Similarly we could see improvements when
> > do_get_write_access() would be waiting for buffer lock because buffer is
> > being written out by checkpointing code. But again I wasn't able to hit this
> > reliably.
> One of most annoying performance issues was unpredictable latency of aio submission
> This is typical workload on chunk server (object storage, cloud storage,
> ceph) where one some tasks performs  aio/dio submission and other
> performs fsync(). Some times we got this io_submit->touch_mtime()->
> do_get_write_access() observes that jh->b_jlist == BJ_Shadow and wait
> for transaction commit. So aio-dio submission can block (even if file
> was previously allocated) for a long time(1-5sec) on ext3/4
> But this was fixed by 'lazytime' option
  Yeah, my patches don't really help this particular problem. They help the
case where buffer already is part of the running transaction but in your
case we need to move the buffer to the running transaction and that blocks
on buffer being under IO. There isn't much we can do to improve this (well,
we could unconditionally create a frozen buffer for journal IO which would
limit maximum latency but at the cost of memcpy for each journal write so
throughput would suffer in case of faster storage).

								Honza
> 
> #Simplified testcase
> #BAD workload which provoke endless fsync->commit_transaction
> while true; do 
>       xfs_io -c "pwrite -b 1M 1M 32M" \
> 	-f t{1,2,3,5,6,7,8,9,10};
>       xfs_io -c "pwrite -b 1M 1M 1M" -c \
> 	"fsync" -f -d t11
> 
> # Measure aio-dio latency
> [root@alice Z]# uname -a
> Linux alice.qa.sw.ru 4.0.0-rc7+ #13 SMP Sun Apr 12 00:34:51 MSK 2015
> x86_64 x86_64 x86_64 GNU/Linux
> [root@alice Z]# ioping -A  -C -D  -WWW t
> 4.0 KiB from t (ext4 /dev/sdb1): request=1 time=441 us
> ...
> 4.0 KiB from t (ext4 /dev/sdb1): request=12 time=393 us
> 4.0 KiB from t (ext4 /dev/sdb1): request=13 time=2.7 s <---- too long
> 4.0 KiB from t (ext4 /dev/sdb1): request=14 time=397 us
> 4.0 KiB from t (ext4 /dev/sdb1): request=15 time=398 us
> ^C
> --- t (ext4 /dev/sdb1) ioping statistics ---
> 15 requests completed in 17.2 s, 5 iops, 22.0 KiB/s
> min/avg/max/mdev = 384 us / 182.2 ms / 2.7 s / 679.1 ms
> >
> > Ted, you mentioned at Vault you had a setup where frequent
> > do_get_write_access() calls were contending in the revoke code. What was the
> > load exactly? These patches should improve that as well...
> >
> > 								Honza
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/5] jbd2: Simplify code flow in do_get_write_access()
  2015-04-02 13:58 ` [PATCH 1/5] jbd2: Simplify code flow in do_get_write_access() Jan Kara
@ 2015-06-08 16:39   ` Theodore Ts'o
  0 siblings, 0 replies; 20+ messages in thread
From: Theodore Ts'o @ 2015-06-08 16:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Thu, Apr 02, 2015 at 03:58:16PM +0200, Jan Kara wrote:
> needs_copy is set only in one place in do_get_write_access(), just move
> the frozen buffer copying into that place and factor it out to a
> separate function to make do_get_write_access() slightly more readable.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied.

						- Ted

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/5] jbd2: Simplify error path on allocation failure in do_get_write_access()
  2015-04-02 13:58 ` [PATCH 2/5] jbd2: Simplify error path on allocation failure " Jan Kara
@ 2015-06-08 16:42   ` Theodore Ts'o
  0 siblings, 0 replies; 20+ messages in thread
From: Theodore Ts'o @ 2015-06-08 16:42 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Thu, Apr 02, 2015 at 03:58:17PM +0200, Jan Kara wrote:
> We were acquiring bh_state_lock when allocation of buffer failed in
> do_get_write_access() only to be able to jump to a label that releases
> the lock and does all other checks that don't make sense for this error
> path. Just jump into the right label instead.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied.

						- Ted

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/5] jbd2: Simplify code flow in do_get_write_access()
  2015-04-02 13:58 ` [PATCH 3/5] jbd2: Simplify code flow " Jan Kara
@ 2015-06-08 16:45   ` Theodore Ts'o
  0 siblings, 0 replies; 20+ messages in thread
From: Theodore Ts'o @ 2015-06-08 16:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Thu, Apr 02, 2015 at 03:58:18PM +0200, Jan Kara wrote:
> Check for the simple case of unjournaled buffer first, handle it and
> bail out. This allows us to remove one if and unindent the difficult case
> by one tab. The result is easier to read.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Applied, although I renamed the summary so we didn't have to commits
with the identical one-line summary in close proximity to each other.

     	 	   	    	       	     - Ted

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()
  2015-04-02 13:58 ` [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access() Jan Kara
@ 2015-06-08 16:47   ` Theodore Ts'o
  2015-06-08 22:32     ` Theodore Ts'o
  2015-06-17 16:56     ` Jan Kara
  0 siblings, 2 replies; 20+ messages in thread
From: Theodore Ts'o @ 2015-06-08 16:47 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Thu, Apr 02, 2015 at 03:58:19PM +0200, Jan Kara wrote:
> jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are
> frequently called for buffers that are already part of the running
> transaction - most frequently it is the case for bitmaps, inode table
> blocks, and superblock. Since in such cases we have nothing to do, it is
> unfortunate we still grab reference to journal head, lock the bh, lock
> bh_state only to find out there's nothing to do.
> 
> Improving this is a bit subtle though since until we find out journal
> head is attached to the running transaction, it can disappear from under
> us because checkpointing / commit decided it's no longer needed. We deal
> with this by protecting journal_head slab with RCU. We still have to be
> careful about journal head being freed & reallocated within slab and
> about exposing journal head in consistent state (in particular
> b_modified and b_frozen_data must be in correct state before we allow
> user to touch the buffer).
> 
> FIXME: Performance data.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Applied, so we can start getting some testing on this patch.  Did you
ever get performance data?

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/5] jbd2: Speedup jbd2_journal_dirty_metadata()
  2015-04-02 13:58 ` [PATCH 5/5] jbd2: Speedup jbd2_journal_dirty_metadata() Jan Kara
@ 2015-06-08 16:50   ` Theodore Ts'o
  0 siblings, 0 replies; 20+ messages in thread
From: Theodore Ts'o @ 2015-06-08 16:50 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Thu, Apr 02, 2015 at 03:58:20PM +0200, Jan Kara wrote:
> It is often the case that we mark buffer as having dirty metadata when
> the buffer is already in that state (frequent for bitmaps, inode table
> blocks, superblock). Thus it is unnecessary to contend on grabbing
> journal head reference and bh_state lock. Avoid that by checking whether
> any modification to the buffer is needed before grabbing any locks or
> references.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Thanks, applied.

						- Ted

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()
  2015-06-08 16:47   ` Theodore Ts'o
@ 2015-06-08 22:32     ` Theodore Ts'o
  2015-06-09  5:24       ` Theodore Ts'o
  2015-06-17 16:56     ` Jan Kara
  1 sibling, 1 reply; 20+ messages in thread
From: Theodore Ts'o @ 2015-06-08 22:32 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Mon, Jun 08, 2015 at 12:47:26PM -0400, Theodore Ts'o wrote:
> On Thu, Apr 02, 2015 at 03:58:19PM +0200, Jan Kara wrote:
> > jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are
> > frequently called for buffers that are already part of the running
> > transaction - most frequently it is the case for bitmaps, inode table
> > blocks, and superblock. Since in such cases we have nothing to do, it is
> > unfortunate we still grab reference to journal head, lock the bh, lock
> > bh_state only to find out there's nothing to do.
> > 
> > Improving this is a bit subtle though since until we find out journal
> > head is attached to the running transaction, it can disappear from under
> > us because checkpointing / commit decided it's no longer needed. We deal
> > with this by protecting journal_head slab with RCU. We still have to be
> > careful about journal head being freed & reallocated within slab and
> > about exposing journal head in consistent state (in particular
> > b_modified and b_frozen_data must be in correct state before we allow
> > user to touch the buffer).
> > 
> > FIXME: Performance data.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> Applied, so we can start getting some testing on this patch.  Did you
> ever get performance data?

.... and this patch is causing generic/011 to fail.

generic/011 2s ...  [18:26:52][   13.085375] run fstests generic/011 at 2015-06-08 18:26:52
[   13.698245] ------------[ cut here ]------------
[   13.699093] kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1329!
[   13.700354] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[   13.701388] Modules linked in:
[   13.701505] CPU: 0 PID: 3947 Comm: dirstress Not tainted 4.1.0-rc4-00034-g562bef4 #2758
[   13.701505] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   13.701505] task: ee1bc110 ti: ec080000 task.ti: ec080000
[   13.701505] EIP: 0060:[<c02eeb96>] EFLAGS: 00210206 CPU: 0
[   13.701505] EIP is at jbd2_journal_dirty_metadata+0x5e/0x1da
[   13.701505] EAX: 00000000 EBX: eccde090 ECX: f03fd580 EDX: f03fd580
[   13.701505] ESI: eee85860 EDI: ed918cc0 EBP: ec081e14 ESP: ec081e00
[   13.701505]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[   13.701505] CR0: 8005003b CR2: b73fbb20 CR3: 2fd58700 CR4: 000006f0
[   13.701505] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[   13.701505] DR6: fffe0ff0 DR7: 00000400
[   13.701505] Stack:
[   13.701505]  000000f7 f03fd580 ed918cc0 eee85860 00000000 ec081e30 c02d7fc8 00001179
[   13.701505]  c0865c3c eae67c30 ee17f800 00000000 ec081e84 c02b6458 00000000 ed918cc0
[   13.701505]  c02ee941 00000000 00000000 00000000 eae67b20 00000000 00000000 eae67e78
[   13.701505] Call Trace:
[   13.701505]  [<c02d7fc8>] __ext4_handle_dirty_metadata+0xd4/0x19d
[   13.701505]  [<c02b6458>] ext4_mark_iloc_dirty+0x458/0x577
[   13.701505]  [<c02ee941>] ? jbd2_journal_get_write_access+0x3d/0x48
[   13.701505]  [<c02b66e4>] ext4_mark_inode_dirty+0x105/0x252
[   13.701505]  [<c02b121b>] __ext4_new_inode+0xcb6/0xe9b
[   13.701505]  [<c02bd87d>] ext4_mknod+0x8b/0x11c
[   13.701505]  [<c025ffbe>] vfs_mknod+0x7e/0x9e
[   13.701505]  [<c0263a0a>] SyS_mknodat+0x119/0x15a
[   13.701505]  [<c0263a65>] SyS_mknod+0x1a/0x1c
[   13.701505]  [<c083292a>] syscall_call+0x7/0x7
[   13.701505] Code: 00 00 8b 5f 24 8b 4b 18 39 d1 74 07 39 53 1c 74 02 0f 0b 83 7b 0c 01 75 14 39 d1 0f 85 7e 01 00 00 83 7b 08 01 0f 84 74 01 00 00 <0f> 0b 8b 02 53 68 16 2c ac c0 68 36 05 00 00 68 14 81 86 c0 68
[   13.701505] EIP: [<c02eeb96>] jbd2_journal_dirty_metadata+0x5e/0x1da SS:ESP 0068:ec081e00
[   13.734150] ---[ end trace eb359de3ec6c3af4 ]---

I will drop 4/5 and 5/5 from this patch series from the ext4 tree for
now.  Could you take a look at this?

      	    	      	    	       	    - Ted

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()
  2015-06-08 22:32     ` Theodore Ts'o
@ 2015-06-09  5:24       ` Theodore Ts'o
  2015-06-17 16:39         ` Jan Kara
  0 siblings, 1 reply; 20+ messages in thread
From: Theodore Ts'o @ 2015-06-09  5:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Mon, Jun 08, 2015 at 06:32:30PM -0400, Theodore Ts'o wrote:
> 
> .... and this patch is causing generic/011 to fail.
> 
> generic/011 2s ...  [18:26:52][   13.085375] run fstests generic/011 at 2015-06-08 18:26:52
> [   13.698245] ------------[ cut here ]------------
> [   13.699093] kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1329!
> [   13.700354] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [   13.701388] Modules linked in:
> [   13.701505] CPU: 0 PID: 3947 Comm: dirstress Not tainted 4.1.0-rc4-00034-g562bef4 #2758
> [   13.701505] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [   13.701505] task: ee1bc110 ti: ec080000 task.ti: ec080000
> [   13.701505] EIP: 0060:[<c02eeb96>] EFLAGS: 00210206 CPU: 0
> [   13.701505] EIP is at jbd2_journal_dirty_metadata+0x5e/0x1da
> [   13.701505] EAX: 00000000 EBX: eccde090 ECX: f03fd580 EDX: f03fd580
> [   13.701505] ESI: eee85860 EDI: ed918cc0 EBP: ec081e14 ESP: ec081e00
> [   13.701505]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> [   13.701505] CR0: 8005003b CR2: b73fbb20 CR3: 2fd58700 CR4: 000006f0
> [   13.701505] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> [   13.701505] DR6: fffe0ff0 DR7: 00000400
> [   13.701505] Stack:
> [   13.701505]  000000f7 f03fd580 ed918cc0 eee85860 00000000 ec081e30 c02d7fc8 00001179
> [   13.701505]  c0865c3c eae67c30 ee17f800 00000000 ec081e84 c02b6458 00000000 ed918cc0
> [   13.701505]  c02ee941 00000000 00000000 00000000 eae67b20 00000000 00000000 eae67e78
> [   13.701505] Call Trace:
> [   13.701505]  [<c02d7fc8>] __ext4_handle_dirty_metadata+0xd4/0x19d
> [   13.701505]  [<c02b6458>] ext4_mark_iloc_dirty+0x458/0x577
> [   13.701505]  [<c02ee941>] ? jbd2_journal_get_write_access+0x3d/0x48
> [   13.701505]  [<c02b66e4>] ext4_mark_inode_dirty+0x105/0x252
> [   13.701505]  [<c02b121b>] __ext4_new_inode+0xcb6/0xe9b
> [   13.701505]  [<c02bd87d>] ext4_mknod+0x8b/0x11c
> [   13.701505]  [<c025ffbe>] vfs_mknod+0x7e/0x9e
> [   13.701505]  [<c0263a0a>] SyS_mknodat+0x119/0x15a
> [   13.701505]  [<c0263a65>] SyS_mknod+0x1a/0x1c
> [   13.701505]  [<c083292a>] syscall_call+0x7/0x7
> [   13.701505] Code: 00 00 8b 5f 24 8b 4b 18 39 d1 74 07 39 53 1c 74 02 0f 0b 83 7b 0c 01 75 14 39 d1 0f 85 7e 01 00 00 83 7b 08 01 0f 84 74 01 00 00 <0f> 0b 8b 02 53 68 16 2c ac c0 68 36 05 00 00 68 14 81 86 c0 68
> [   13.701505] EIP: [<c02eeb96>] jbd2_journal_dirty_metadata+0x5e/0x1da SS:ESP 0068:ec081e00
> [   13.734150] ---[ end trace eb359de3ec6c3af4 ]---
> 
> I will drop 4/5 and 5/5 from this patch series from the ext4 tree for
> now.  Could you take a look at this?

It looks like the problem is really caused by the patch #5/5, not
patch #4/5.  So I'll drop patch #5 and do more in-depth testing with
the first four patches in the patch series.

					- Ted
					

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()
  2015-06-09  5:24       ` Theodore Ts'o
@ 2015-06-17 16:39         ` Jan Kara
  0 siblings, 0 replies; 20+ messages in thread
From: Jan Kara @ 2015-06-17 16:39 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 3184 bytes --]

On Tue 09-06-15 01:24:50, Ted Tso wrote:
> On Mon, Jun 08, 2015 at 06:32:30PM -0400, Theodore Ts'o wrote:
> > 
> > .... and this patch is causing generic/011 to fail.
> > 
> > generic/011 2s ...  [18:26:52][   13.085375] run fstests generic/011 at 2015-06-08 18:26:52
> > [   13.698245] ------------[ cut here ]------------
> > [   13.699093] kernel BUG at /usr/projects/linux/ext4/fs/jbd2/transaction.c:1329!
> > [   13.700354] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> > [   13.701388] Modules linked in:
> > [   13.701505] CPU: 0 PID: 3947 Comm: dirstress Not tainted 4.1.0-rc4-00034-g562bef4 #2758
> > [   13.701505] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > [   13.701505] task: ee1bc110 ti: ec080000 task.ti: ec080000
> > [   13.701505] EIP: 0060:[<c02eeb96>] EFLAGS: 00210206 CPU: 0
> > [   13.701505] EIP is at jbd2_journal_dirty_metadata+0x5e/0x1da
> > [   13.701505] EAX: 00000000 EBX: eccde090 ECX: f03fd580 EDX: f03fd580
> > [   13.701505] ESI: eee85860 EDI: ed918cc0 EBP: ec081e14 ESP: ec081e00
> > [   13.701505]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> > [   13.701505] CR0: 8005003b CR2: b73fbb20 CR3: 2fd58700 CR4: 000006f0
> > [   13.701505] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> > [   13.701505] DR6: fffe0ff0 DR7: 00000400
> > [   13.701505] Stack:
> > [   13.701505]  000000f7 f03fd580 ed918cc0 eee85860 00000000 ec081e30 c02d7fc8 00001179
> > [   13.701505]  c0865c3c eae67c30 ee17f800 00000000 ec081e84 c02b6458 00000000 ed918cc0
> > [   13.701505]  c02ee941 00000000 00000000 00000000 eae67b20 00000000 00000000 eae67e78
> > [   13.701505] Call Trace:
> > [   13.701505]  [<c02d7fc8>] __ext4_handle_dirty_metadata+0xd4/0x19d
> > [   13.701505]  [<c02b6458>] ext4_mark_iloc_dirty+0x458/0x577
> > [   13.701505]  [<c02ee941>] ? jbd2_journal_get_write_access+0x3d/0x48
> > [   13.701505]  [<c02b66e4>] ext4_mark_inode_dirty+0x105/0x252
> > [   13.701505]  [<c02b121b>] __ext4_new_inode+0xcb6/0xe9b
> > [   13.701505]  [<c02bd87d>] ext4_mknod+0x8b/0x11c
> > [   13.701505]  [<c025ffbe>] vfs_mknod+0x7e/0x9e
> > [   13.701505]  [<c0263a0a>] SyS_mknodat+0x119/0x15a
> > [   13.701505]  [<c0263a65>] SyS_mknod+0x1a/0x1c
> > [   13.701505]  [<c083292a>] syscall_call+0x7/0x7
> > [   13.701505] Code: 00 00 8b 5f 24 8b 4b 18 39 d1 74 07 39 53 1c 74 02 0f 0b 83 7b 0c 01 75 14 39 d1 0f 85 7e 01 00 00 83 7b 08 01 0f 84 74 01 00 00 <0f> 0b 8b 02 53 68 16 2c ac c0 68 36 05 00 00 68 14 81 86 c0 68
> > [   13.701505] EIP: [<c02eeb96>] jbd2_journal_dirty_metadata+0x5e/0x1da SS:ESP 0068:ec081e00
> > [   13.734150] ---[ end trace eb359de3ec6c3af4 ]---
> > 
> > I will drop 4/5 and 5/5 from this patch series from the ext4 tree for
> > now.  Could you take a look at this?
> 
> It looks like the problem is really caused by the patch #5/5, not
> patch #4/5.  So I'll drop patch #5 and do more in-depth testing with
> the first four patches in the patch series.
  Yeah, sorry. I had fixed this locally (the assertion I have added had
false positives) but I didn't resend. I have checked and I don't have any
other changes locally. New version of the patch 5/5 is attached.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0005-jbd2-Speedup-jbd2_journal_dirty_metadata.patch --]
[-- Type: text/x-patch, Size: 2668 bytes --]

>From 00168e5ef68e927fad169d8fc54d10e066b6883b Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 18 Mar 2015 14:29:13 +0100
Subject: [PATCH 5/5] jbd2: Speedup jbd2_journal_dirty_metadata()

It is often the case that we mark buffer as having dirty metadata when
the buffer is already in that state (frequent for bitmaps, inode table
blocks, superblock). Thus it is unnecessary to contend on grabbing
journal head reference and bh_state lock. Avoid that by checking whether
any modification to the buffer is needed before grabbing any locks or
references.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a91f639af6c3..7a3ffd0c855f 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1290,8 +1290,6 @@ void jbd2_buffer_abort_trigger(struct journal_head *jh,
 	triggers->t_abort(triggers, jh2bh(jh));
 }
 
-
-
 /**
  * int jbd2_journal_dirty_metadata() -  mark a buffer as containing dirty metadata
  * @handle: transaction to add buffer to.
@@ -1325,12 +1323,36 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
 	WARN_ON(!transaction);
 	if (is_handle_aborted(handle))
 		return -EROFS;
-	journal = transaction->t_journal;
-	jh = jbd2_journal_grab_journal_head(bh);
-	if (!jh) {
+	if (!buffer_jbd(bh)) {
 		ret = -EUCLEAN;
 		goto out;
 	}
+	/*
+	 * We don't grab jh reference here since the buffer must be part
+	 * of the running transaction.
+	 */
+	jh = bh2jh(bh);
+	J_ASSERT_JH(jh, jh->b_transaction == transaction ||
+			jh->b_next_transaction == transaction);
+	if (jh->b_modified == 1) {
+		/*
+		 * If it's in our transaction it must be in BJ_Metadata list.
+		 * The assertion is unreliable since we may see jh in
+		 * inconsistent state unless we grab bh_state lock. But this
+		 * is crutial to catch bugs so let's do a reliable check until
+		 * the lockless handling is fully proven.
+		 */
+		if (jh->b_transaction == transaction &&
+		    jh->b_jlist != BJ_Metadata) {
+			jbd_lock_bh_state(bh);
+			J_ASSERT_JH(jh, jh->b_transaction != transaction ||
+					jh->b_jlist == BJ_Metadata);
+			jbd_lock_bh_state(bh);
+		}
+		goto out;
+	}
+
+	journal = transaction->t_journal;
 	jbd_debug(5, "journal_head %p\n", jh);
 	JBUFFER_TRACE(jh, "entry");
 
@@ -1421,7 +1443,6 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
 	spin_unlock(&journal->j_list_lock);
 out_unlock_bh:
 	jbd_unlock_bh_state(bh);
-	jbd2_journal_put_journal_head(jh);
 out:
 	JBUFFER_TRACE(jh, "exit");
 	return ret;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()
  2015-06-08 16:47   ` Theodore Ts'o
  2015-06-08 22:32     ` Theodore Ts'o
@ 2015-06-17 16:56     ` Jan Kara
       [not found]       ` <CAA1ppbhojR0aaDr-BUWQLWQDo5+sO9Tc6b=Dxf5XrRAr2DT0oQ@mail.gmail.com>
  1 sibling, 1 reply; 20+ messages in thread
From: Jan Kara @ 2015-06-17 16:56 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Jan Kara, linux-ext4

On Mon 08-06-15 12:47:26, Ted Tso wrote:
> On Thu, Apr 02, 2015 at 03:58:19PM +0200, Jan Kara wrote:
> > jbd2_journal_get_write_access() and jbd2_journal_get_create_access() are
> > frequently called for buffers that are already part of the running
> > transaction - most frequently it is the case for bitmaps, inode table
> > blocks, and superblock. Since in such cases we have nothing to do, it is
> > unfortunate we still grab reference to journal head, lock the bh, lock
> > bh_state only to find out there's nothing to do.
> > 
> > Improving this is a bit subtle though since until we find out journal
> > head is attached to the running transaction, it can disappear from under
> > us because checkpointing / commit decided it's no longer needed. We deal
> > with this by protecting journal_head slab with RCU. We still have to be
> > careful about journal head being freed & reallocated within slab and
> > about exposing journal head in consistent state (in particular
> > b_modified and b_frozen_data must be in correct state before we allow
> > user to touch the buffer).
> > 
> > FIXME: Performance data.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> Applied, so we can start getting some testing on this patch.  Did you
> ever get performance data?

Yes. Here are results for reaim fserver workload for 32 core machine with
128 GB of ram with ext4 on ramdisk:
Procs Vanilla    Patched
1      20420.688  21155.556
21     49684.704 178934.074
41     84630.364 196647.482
61    106344.284 204831.652
81    120751.370 214842.428
101   131585.450 208761.832
121   138092.078 212741.648
141   142271.578 212118.502
161   146008.364 213731.388
181   149569.494 216121.444

Numbers are operations per second so larger is better. You can see that
for 21 processes we get increase by 260% in the number operations. Also the
total maximum of operations the machine is able to achieve increases by
44% because of overall lower CPU overhead.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()
       [not found]       ` <CAA1ppbhojR0aaDr-BUWQLWQDo5+sO9Tc6b=Dxf5XrRAr2DT0oQ@mail.gmail.com>
@ 2015-06-18  8:52         ` Jan Kara
  2015-06-21  1:56           ` Theodore Ts'o
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2015-06-18  8:52 UTC (permalink / raw)
  To: Jerry Lee; +Cc: Jan Kara, Theodore Ts'o, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 3678 bytes --]

On Thu 18-06-15 15:59:33, Jerry Lee wrote:
> Hi:
> 
> I found that there may be a typo in the attached patch 5/5:
> 
> +        /*
> +         * If it's in our transaction it must be in BJ_Metadata list.
> +         * The assertion is unreliable since we may see jh in
> +         * inconsistent state unless we grab bh_state lock. But this
> +         * is crutial to catch bugs so let's do a reliable check until
> +         * the lockless handling is fully proven.
> +         */
> +        if (jh->b_transaction == transaction &&
> +            jh->b_jlist != BJ_Metadata) {
> +            jbd_lock_bh_state(bh);
> +            J_ASSERT_JH(jh, jh->b_transaction != transaction ||
> +                    jh->b_jlist == BJ_Metadata);
> +            jbd_lock_bh_state(bh);
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +        }
> +        goto out;
> 
> Does the highlight part should be "jbd_unlock_bh_state(bh)"?
  Yeah, thanks for catching this. I was apparently pretty lucky in this
passing all the tests... Attached is the new version of the patch.

								Honza

> On Thu, Jun 18, 2015 at 12:56 AM, Jan Kara <jack@suse.cz> wrote:
> 
> > On Mon 08-06-15 12:47:26, Ted Tso wrote:
> > > On Thu, Apr 02, 2015 at 03:58:19PM +0200, Jan Kara wrote:
> > > > jbd2_journal_get_write_access() and jbd2_journal_get_create_access()
> > are
> > > > frequently called for buffers that are already part of the running
> > > > transaction - most frequently it is the case for bitmaps, inode table
> > > > blocks, and superblock. Since in such cases we have nothing to do, it
> > is
> > > > unfortunate we still grab reference to journal head, lock the bh, lock
> > > > bh_state only to find out there's nothing to do.
> > > >
> > > > Improving this is a bit subtle though since until we find out journal
> > > > head is attached to the running transaction, it can disappear from
> > under
> > > > us because checkpointing / commit decided it's no longer needed. We
> > deal
> > > > with this by protecting journal_head slab with RCU. We still have to be
> > > > careful about journal head being freed & reallocated within slab and
> > > > about exposing journal head in consistent state (in particular
> > > > b_modified and b_frozen_data must be in correct state before we allow
> > > > user to touch the buffer).
> > > >
> > > > FIXME: Performance data.
> > > >
> > > > Signed-off-by: Jan Kara <jack@suse.cz>
> > >
> > > Applied, so we can start getting some testing on this patch.  Did you
> > > ever get performance data?
> >
> > Yes. Here are results for reaim fserver workload for 32 core machine with
> > 128 GB of ram with ext4 on ramdisk:
> > Procs Vanilla    Patched
> > 1      20420.688  21155.556
> > 21     49684.704 178934.074
> > 41     84630.364 196647.482
> > 61    106344.284 204831.652
> > 81    120751.370 214842.428
> > 101   131585.450 208761.832
> > 121   138092.078 212741.648
> > 141   142271.578 212118.502
> > 161   146008.364 213731.388
> > 181   149569.494 216121.444
> >
> > Numbers are operations per second so larger is better. You can see that
> > for 21 processes we get increase by 260% in the number operations. Also the
> > total maximum of operations the machine is able to achieve increases by
> > 44% because of overall lower CPU overhead.
> >
> >                                                                 Honza
> > --
> > Jan Kara <jack@suse.cz>
> > SUSE Labs, CR
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0005-jbd2-Speedup-jbd2_journal_dirty_metadata.patch --]
[-- Type: text/x-patch, Size: 2670 bytes --]

>From a19a4452879743eb7c8db593adbf60fc9ff0427c Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Wed, 18 Mar 2015 14:29:13 +0100
Subject: [PATCH 5/5] jbd2: Speedup jbd2_journal_dirty_metadata()

It is often the case that we mark buffer as having dirty metadata when
the buffer is already in that state (frequent for bitmaps, inode table
blocks, superblock). Thus it is unnecessary to contend on grabbing
journal head reference and bh_state lock. Avoid that by checking whether
any modification to the buffer is needed before grabbing any locks or
references.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a91f639af6c3..5887c880422b 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1290,8 +1290,6 @@ void jbd2_buffer_abort_trigger(struct journal_head *jh,
 	triggers->t_abort(triggers, jh2bh(jh));
 }
 
-
-
 /**
  * int jbd2_journal_dirty_metadata() -  mark a buffer as containing dirty metadata
  * @handle: transaction to add buffer to.
@@ -1325,12 +1323,36 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
 	WARN_ON(!transaction);
 	if (is_handle_aborted(handle))
 		return -EROFS;
-	journal = transaction->t_journal;
-	jh = jbd2_journal_grab_journal_head(bh);
-	if (!jh) {
+	if (!buffer_jbd(bh)) {
 		ret = -EUCLEAN;
 		goto out;
 	}
+	/*
+	 * We don't grab jh reference here since the buffer must be part
+	 * of the running transaction.
+	 */
+	jh = bh2jh(bh);
+	J_ASSERT_JH(jh, jh->b_transaction == transaction ||
+			jh->b_next_transaction == transaction);
+	if (jh->b_modified == 1) {
+		/*
+		 * If it's in our transaction it must be in BJ_Metadata list.
+		 * The assertion is unreliable since we may see jh in
+		 * inconsistent state unless we grab bh_state lock. But this
+		 * is crutial to catch bugs so let's do a reliable check until
+		 * the lockless handling is fully proven.
+		 */
+		if (jh->b_transaction == transaction &&
+		    jh->b_jlist != BJ_Metadata) {
+			jbd_lock_bh_state(bh);
+			J_ASSERT_JH(jh, jh->b_transaction != transaction ||
+					jh->b_jlist == BJ_Metadata);
+			jbd_unlock_bh_state(bh);
+		}
+		goto out;
+	}
+
+	journal = transaction->t_journal;
 	jbd_debug(5, "journal_head %p\n", jh);
 	JBUFFER_TRACE(jh, "entry");
 
@@ -1421,7 +1443,6 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
 	spin_unlock(&journal->j_list_lock);
 out_unlock_bh:
 	jbd_unlock_bh_state(bh);
-	jbd2_journal_put_journal_head(jh);
 out:
 	JBUFFER_TRACE(jh, "exit");
 	return ret;
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access()
  2015-06-18  8:52         ` Jan Kara
@ 2015-06-21  1:56           ` Theodore Ts'o
  0 siblings, 0 replies; 20+ messages in thread
From: Theodore Ts'o @ 2015-06-21  1:56 UTC (permalink / raw)
  To: Jan Kara; +Cc: Jerry Lee, linux-ext4

On Thu, Jun 18, 2015 at 10:52:35AM +0200, Jan Kara wrote:
>   Yeah, thanks for catching this. I was apparently pretty lucky in this
> passing all the tests... Attached is the new version of the patch.

Thanks, I've added this to the ext4 tree.

	     	   	       	    - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-06-21  1:56 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-02 13:58 [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled Jan Kara
2015-04-02 13:58 ` [PATCH 1/5] jbd2: Simplify code flow in do_get_write_access() Jan Kara
2015-06-08 16:39   ` Theodore Ts'o
2015-04-02 13:58 ` [PATCH 2/5] jbd2: Simplify error path on allocation failure " Jan Kara
2015-06-08 16:42   ` Theodore Ts'o
2015-04-02 13:58 ` [PATCH 3/5] jbd2: Simplify code flow " Jan Kara
2015-06-08 16:45   ` Theodore Ts'o
2015-04-02 13:58 ` [PATCH 4/5] jbd2: Speedup jbd2_journal_get_[write|undo]_access() Jan Kara
2015-06-08 16:47   ` Theodore Ts'o
2015-06-08 22:32     ` Theodore Ts'o
2015-06-09  5:24       ` Theodore Ts'o
2015-06-17 16:39         ` Jan Kara
2015-06-17 16:56     ` Jan Kara
     [not found]       ` <CAA1ppbhojR0aaDr-BUWQLWQDo5+sO9Tc6b=Dxf5XrRAr2DT0oQ@mail.gmail.com>
2015-06-18  8:52         ` Jan Kara
2015-06-21  1:56           ` Theodore Ts'o
2015-04-02 13:58 ` [PATCH 5/5] jbd2: Speedup jbd2_journal_dirty_metadata() Jan Kara
2015-06-08 16:50   ` Theodore Ts'o
2015-04-02 14:23 ` [PATCH 0/5] jbd2: Avoid unnecessary locking when buffer is already journaled Theodore Ts'o
2015-04-12 10:09 ` Dmitry Monakhov
2015-04-16 10:46   ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.