linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors
@ 2019-10-03 22:05 Jan Kara
  2019-10-03 22:05 ` [PATCH 01/22] jbd2: Fix possible overflow in jbd2_log_space_left() Jan Kara
                   ` (51 more replies)
  0 siblings, 52 replies; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

Hello,

Here is v3 of this series with couple more bugs fixed. Now all failed tests Ted
higlighted pass for me.

Changes since v2:
* Fixed bug in revoke credit estimates for extent freeing in bigalloc
  filesystems
* Fixed bug in xattr code treating positive return of
  ext4_journal_ensure_credits() as error
* Fixed preexisting bug in ext4_evict_inode() where we reserved too few credits
* Added trace point to jbd2_journal_restart()
* Fix some kernel doc bugs
* Rebased on top of 5.4-rc1

Changes since v1:
* Reordered some patches to reduce code churn
* Computation in jbd2_revoke_descriptors_per_block() was too early - moved it
  to later when journal superblock is loaded and so the feature checking
  actually works.
* Made sure nobody outside JBD2 uses handle->h_buffer_credits since now it
  contains also credits for revoke descriptors and it was confusing come users.
* Updated cover letter with more details about reproducer

Original cover letter:

I've recently got a bug report where JBD2 assertion failed due to
transaction commit running out of journal space. After closer inspection of
the crash dump it seems that the problem is that there were too many
journal descriptor blocks (more that max_transaction_size >> 5 + 32 we
estimate in jbd2_log_space_left()) due to descriptor blocks with revoke
records. In fact the estimate on the number of descriptor blocks looks
pretty arbitrary and there can be much more descriptor blocks needed for
revoke records. We need one revoke record for every metadata block freed.
So in the worst case (1k blocksize, 64-bit journal feature enabled,
checksumming enabled) we fit 125 revoke record in one descriptor block.  In
common cases its about 500 revoke records per descriptor block. Now when
we free large directories or large file with data journalling enabled, we can
have *lots* of blocks to revoke - with extent mapped files easily millions
in a single transaction which can mean 10k descriptor blocks - clearly more
than the estimate of 128 descriptor blocks per transaction ;)

This patch series aims at fixing the problem by accounting descriptor blocks
into transaction credits and reserving appropriate amount of credits for revoke
descriptors on transaction handle start. Similar to normal transaction credits,
the filesystem has to provide estimate for the number of blocks it is going
to revoke using the transaction handle so that credits for revoke descriptors
can be reserved.

The series has survived fstests in couple configurations and also the stress
test like:
  Create filesystem with 1KB blocksize and journal size 32MB
  Mount the filesystem with -o nodelalloc
  for (( i = 0; i < 4; i++ )); do
    dd if=/dev/zero of=file$i bs=1M count=2048 conv=fsync
    chattr +j file$i
  done
  for (( i = 0; i < 4; i++ )); do
    rm file$i&
  done

which reliably triggers the assertion failure in JBD2 on unpatched kernel.

Review and comments are welcome :).

								Honza
Previous versions:
Link: http://lore.kernel.org/r/20190927111536.16455-1-jack@suse.cz
Link: http://lore.kernel.org/r/20190930103544.11479-1-jack@suse.cz

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 01/22] jbd2: Fix possible overflow in jbd2_log_space_left()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
@ 2019-10-03 22:05 ` Jan Kara
  2019-10-21  1:08   ` Theodore Y. Ts'o
  2019-10-03 22:05 ` [PATCH 02/22] jbd2: Fixup stale comment in commit code Jan Kara
                   ` (50 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara, stable

When number of free space in the journal is very low, the arithmetic in
jbd2_log_space_left() could underflow resulting in very high number of
free blocks and thus triggering assertion failure in transaction commit
code complaining there's not enough space in the journal:

J_ASSERT(journal->j_free > 1);

Properly check for the low number of free blocks.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/jbd2.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 603fbc4e2f70..10e6049c0ba9 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1582,7 +1582,7 @@ static inline int jbd2_space_needed(journal_t *journal)
 static inline unsigned long jbd2_log_space_left(journal_t *journal)
 {
 	/* Allow for rounding errors */
-	unsigned long free = journal->j_free - 32;
+	long free = journal->j_free - 32;
 
 	if (journal->j_committing_transaction) {
 		unsigned long committing = atomic_read(&journal->
@@ -1591,7 +1591,7 @@ static inline unsigned long jbd2_log_space_left(journal_t *journal)
 		/* Transaction + control blocks */
 		free -= committing + (committing >> JBD2_CONTROL_BLOCKS_SHIFT);
 	}
-	return free;
+	return max_t(long, free, 0);
 }
 
 /*
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 02/22] jbd2: Fixup stale comment in commit code
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
  2019-10-03 22:05 ` [PATCH 01/22] jbd2: Fix possible overflow in jbd2_log_space_left() Jan Kara
@ 2019-10-03 22:05 ` Jan Kara
  2019-10-21  1:08   ` Theodore Y. Ts'o
  2019-10-03 22:05 ` [PATCH 03/22] ext4: Do not iput inode under running transaction in ext4_mkdir() Jan Kara
                   ` (49 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

jbd2_journal_next_log_block() does not look at
transaction->t_outstanding_credits. Remove the misleading comment.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/commit.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 132fb92098c7..c6d39f2ad828 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -642,8 +642,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 
 		/*
 		 * start_this_handle() uses t_outstanding_credits to determine
-		 * the free space in the log, but this counter is changed
-		 * by jbd2_journal_next_log_block() also.
+		 * the free space in the log.
 		 */
 		atomic_dec(&commit_transaction->t_outstanding_credits);
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 03/22] ext4: Do not iput inode under running transaction in ext4_mkdir()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
  2019-10-03 22:05 ` [PATCH 01/22] jbd2: Fix possible overflow in jbd2_log_space_left() Jan Kara
  2019-10-03 22:05 ` [PATCH 02/22] jbd2: Fixup stale comment in commit code Jan Kara
@ 2019-10-03 22:05 ` Jan Kara
  2019-10-21  1:21   ` Theodore Y. Ts'o
  2019-10-03 22:05 ` [PATCH 04/22] ext4: Fix credit estimate for final inode freeing Jan Kara
                   ` (48 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara, stable

When ext4_mkdir() fails to add entry into directory, it ends up dropping
freshly created inode under the running transaction and thus inode
truncation happens under that transaction. That breaks assumptions that
ext4_evict_inode() does not get called from a transaction context
(although I'm not aware of any real issue) and is completely
unnecessary. Just stop the transaction before dropping inode reference.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/namei.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index a427d2031a8d..9c872a33aea7 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2781,8 +2781,9 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 		clear_nlink(inode);
 		unlock_new_inode(inode);
 		ext4_mark_inode_dirty(handle, inode);
+		ext4_journal_stop(handle);
 		iput(inode);
-		goto out_stop;
+		goto out_retry;
 	}
 	ext4_inc_count(handle, dir);
 	ext4_update_dx_flag(dir);
@@ -2796,6 +2797,7 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 out_stop:
 	if (handle)
 		ext4_journal_stop(handle);
+out_retry:
 	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
 		goto retry;
 	return err;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 04/22] ext4: Fix credit estimate for final inode freeing
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (2 preceding siblings ...)
  2019-10-03 22:05 ` [PATCH 03/22] ext4: Do not iput inode under running transaction in ext4_mkdir() Jan Kara
@ 2019-10-03 22:05 ` Jan Kara
  2019-10-21  1:07   ` Theodore Y. Ts'o
  2019-10-03 22:05 ` [PATCH 05/22] ext4: Fix ext4_should_journal_data() for EA inodes Jan Kara
                   ` (47 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara, stable

Estimate for the number of credits needed for final freeing of inode in
ext4_evict_inode() was to small. We may modify 4 blocks (inode & sb for
orphan deletion, bitmap & group descriptor for inode freeing) and not
just 3.

Fixes: e50e5129f384 ("ext4: xattr-in-inode support")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 516faa280ced..e6b631d50c26 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -196,7 +196,12 @@ void ext4_evict_inode(struct inode *inode)
 {
 	handle_t *handle;
 	int err;
-	int extra_credits = 3;
+	/*
+	 * Credits for final inode cleanup and freeing:
+	 * sb + inode (ext4_orphan_del()), block bitmap, group descriptor
+	 * (xattr block freeind), bitmap, group descriptor (inode freeing)
+	 */
+	int extra_credits = 6;
 	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	trace_ext4_evict_inode(inode);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 05/22] ext4: Fix ext4_should_journal_data() for EA inodes
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (3 preceding siblings ...)
  2019-10-03 22:05 ` [PATCH 04/22] ext4: Fix credit estimate for final inode freeing Jan Kara
@ 2019-10-03 22:05 ` Jan Kara
  2019-10-21  1:38   ` Theodore Y. Ts'o
  2019-10-03 22:05 ` [PATCH 06/22] ext4: Use ext4_journal_extend() instead of jbd2_journal_extend() Jan Kara
                   ` (46 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

Similarly to directories, EA inodes do only journalled modifications to
their data. Change ext4_should_journal_data() to return true for them so
that we don't have to special-case them during truncate.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4_jbd2.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index ef8fcf7d0d3b..99fe72522960 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -407,6 +407,7 @@ static inline int ext4_inode_journal_mode(struct inode *inode)
 		return EXT4_INODE_WRITEBACK_DATA_MODE;	/* writeback */
 	/* We do not support data journalling with delayed allocation */
 	if (!S_ISREG(inode->i_mode) ||
+	    ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE) ||
 	    test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA ||
 	    (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA) &&
 	    !test_opt(inode->i_sb, DELALLOC))) {
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 06/22] ext4: Use ext4_journal_extend() instead of jbd2_journal_extend()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (4 preceding siblings ...)
  2019-10-03 22:05 ` [PATCH 05/22] ext4: Fix ext4_should_journal_data() for EA inodes Jan Kara
@ 2019-10-03 22:05 ` Jan Kara
  2019-10-21  1:39   ` Theodore Y. Ts'o
  2019-10-03 22:05 ` [PATCH 07/22] ext4: Avoid unnecessary revokes in ext4_alloc_branch() Jan Kara
                   ` (45 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

Use ext4 helper ext4_journal_extend() instead of opencoding it in
ext4_try_to_expand_extra_isize().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e6b631d50c26..042d23a81f44 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5970,8 +5970,7 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 	 * If this is felt to be critical, then e2fsck should be run to
 	 * force a large enough s_min_extra_isize.
 	 */
-	if (ext4_handle_valid(handle) &&
-	    jbd2_journal_extend(handle,
+	if (ext4_journal_extend(handle,
 				EXT4_DATA_TRANS_BLOCKS(inode->i_sb)) != 0)
 		return -ENOSPC;
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 07/22] ext4: Avoid unnecessary revokes in ext4_alloc_branch()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (5 preceding siblings ...)
  2019-10-03 22:05 ` [PATCH 06/22] ext4: Use ext4_journal_extend() instead of jbd2_journal_extend() Jan Kara
@ 2019-10-03 22:05 ` Jan Kara
  2019-10-21 13:39   ` Theodore Y. Ts'o
  2019-10-03 22:05 ` [PATCH 08/22] ext4: Provide function to handle transaction restarts Jan Kara
                   ` (44 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

Error cleanup path in ext4_alloc_branch() calls ext4_forget() on freshly
allocated indirect blocks with 'metadata' set to 1. This results in
generating revoke records for these blocks. However this is unnecessary
as the freed blocks are only allocated in the current transaction and
thus they will never be journalled. Make this cleanup path similar to
e.g. cleanup in ext4_splice_branch() and use ext4_free_blocks() to
handle block forgetting by passing EXT4_FREE_BLOCKS_FORGET and not
EXT4_FREE_BLOCKS_METADATA to ext4_free_blocks(). This also allows
allocating transaction not to reserve any credits for revoke records.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 36699a131168..602abae08387 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -331,11 +331,14 @@ static int ext4_alloc_branch(handle_t *handle,
 	for (i = 0; i <= indirect_blks; i++) {
 		if (i == indirect_blks) {
 			new_blocks[i] = ext4_mb_new_blocks(handle, ar, &err);
-		} else
+		} else {
 			ar->goal = new_blocks[i] = ext4_new_meta_blocks(handle,
 					ar->inode, ar->goal,
 					ar->flags & EXT4_MB_DELALLOC_RESERVED,
 					NULL, &err);
+			/* Simplify error cleanup... */
+			branch[i+1].bh = NULL;
+		}
 		if (err) {
 			i--;
 			goto failed;
@@ -377,18 +380,25 @@ static int ext4_alloc_branch(handle_t *handle,
 	}
 	return 0;
 failed:
+	if (i == indirect_blks) {
+		/* Free data blocks */
+		ext4_free_blocks(handle, ar->inode, NULL, new_blocks[i],
+				 ar->len, 0);
+		i--;
+	}
 	for (; i >= 0; i--) {
 		/*
 		 * We want to ext4_forget() only freshly allocated indirect
-		 * blocks.  Buffer for new_blocks[i-1] is at branch[i].bh and
-		 * buffer at branch[0].bh is indirect block / inode already
-		 * existing before ext4_alloc_branch() was called.
+		 * blocks. Buffer for new_blocks[i] is at branch[i+1].bh
+		 * (buffer at branch[0].bh is indirect block / inode already
+		 * existing before ext4_alloc_branch() was called). Also
+		 * because blocks are freshly allocated, we don't need to
+		 * revoke them which is why we don't set
+		 * EXT4_FREE_BLOCKS_METADATA.
 		 */
-		if (i > 0 && i != indirect_blks && branch[i].bh)
-			ext4_forget(handle, 1, ar->inode, branch[i].bh,
-				    branch[i].bh->b_blocknr);
-		ext4_free_blocks(handle, ar->inode, NULL, new_blocks[i],
-				 (i == indirect_blks) ? ar->len : 1, 0);
+		ext4_free_blocks(handle, ar->inode, branch[i+1].bh,
+				 new_blocks[i], 1,
+				 branch[i+1].bh ? EXT4_FREE_BLOCKS_FORGET : 0);
 	}
 	return err;
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 08/22] ext4: Provide function to handle transaction restarts
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (6 preceding siblings ...)
  2019-10-03 22:05 ` [PATCH 07/22] ext4: Avoid unnecessary revokes in ext4_alloc_branch() Jan Kara
@ 2019-10-03 22:05 ` Jan Kara
  2019-10-21 16:20   ` Theodore Y. Ts'o
  2019-10-03 22:05 ` [PATCH 09/22] ext4, jbd2: Provide accessor function for handle credits Jan Kara
                   ` (43 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

Provide ext4_journal_ensure_credits_fn() function to ensure transaction
has given amount of credits and call helper function to prepare for
restarting a transaction. This allows to remove some boilerplate code
from various places, add proper error handling for the case where
transaction extension or restart fails, and reduces following changes
needed for proper revoke record reservation tracking.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h      |  4 ++-
 fs/ext4/ext4_jbd2.c | 11 +++++++
 fs/ext4/ext4_jbd2.h | 48 +++++++++++++++++++++++++++
 fs/ext4/extents.c   | 68 ++++++++++++++++++++++----------------
 fs/ext4/indirect.c  | 93 +++++++++++++++++++++++++++++----------------------
 fs/ext4/inode.c     | 26 ---------------
 fs/ext4/migrate.c   | 95 ++++++++++++++++++++---------------------------------
 fs/ext4/resize.c    | 46 ++++++--------------------
 fs/ext4/xattr.c     | 90 +++++++++++++++++++-------------------------------
 9 files changed, 234 insertions(+), 247 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 03db3e71676c..67a6fcc11182 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2604,7 +2604,6 @@ extern int ext4_can_truncate(struct inode *inode);
 extern int ext4_truncate(struct inode *);
 extern int ext4_break_layouts(struct inode *);
 extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length);
-extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int nblocks);
 extern void ext4_set_inode_flags(struct inode *);
 extern int ext4_alloc_da_blocks(struct inode *inode);
 extern void ext4_set_aops(struct inode *inode);
@@ -3296,6 +3295,9 @@ extern int ext4_swap_extents(handle_t *handle, struct inode *inode1,
 			     ext4_lblk_t lblk2,  ext4_lblk_t count,
 			     int mark_unwritten,int *err);
 extern int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu);
+extern int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode,
+				       int check_cred, int restart_cred);
+
 
 /* move_extent.c */
 extern void ext4_double_down_write_data_sem(struct inode *first,
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 7c70b08d104c..2b98d893cda9 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -133,6 +133,17 @@ handle_t *__ext4_journal_start_reserved(handle_t *handle, unsigned int line,
 	return handle;
 }
 
+int __ext4_journal_ensure_credits(handle_t *handle, int check_cred,
+				  int extend_cred)
+{
+	if (!ext4_handle_valid(handle))
+		return 0;
+	if (handle->h_buffer_credits >= check_cred)
+		return 0;
+	return ext4_journal_extend(handle,
+				   extend_cred - handle->h_buffer_credits);
+}
+
 static void ext4_journal_abort_handle(const char *caller, unsigned int line,
 				      const char *err_fn,
 				      struct buffer_head *bh,
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 99fe72522960..1920b976eef1 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -346,6 +346,54 @@ static inline int ext4_journal_restart(handle_t *handle, int nblocks)
 	return 0;
 }
 
+int __ext4_journal_ensure_credits(handle_t *handle, int check_cred,
+				  int extend_cred);
+
+
+/*
+ * Ensure @handle has at least @check_creds credits available. If not,
+ * transaction will be extended or restarted to contain at least @extend_cred
+ * credits. Before restarting transaction @fn is executed to allow for cleanup
+ * before the transaction is restarted.
+ *
+ * The return value is < 0 in case of error, 0 in case the handle has enough
+ * credits or transaction extension succeeded, 1 in case transaction had to be
+ * restarted.
+ */
+#define ext4_journal_ensure_credits_fn(handle, check_cred, extend_cred, fn) \
+({									\
+	__label__ __ensure_end;						\
+	int err = __ext4_journal_ensure_credits((handle), (check_cred),	\
+						(extend_cred));		\
+									\
+	if (err <= 0)							\
+		goto __ensure_end;					\
+	err = (fn);							\
+	if (err < 0)							\
+		goto __ensure_end;					\
+	err = ext4_journal_restart((handle), (extend_cred));		\
+	if (err == 0)							\
+		err = 1;						\
+__ensure_end:								\
+	err;								\
+})
+
+/*
+ * Ensure given handle has at least requested amount of credits available,
+ * possibly restarting transaction if needed.
+ */
+static inline int ext4_journal_ensure_credits(handle_t *handle, int credits)
+{
+	return ext4_journal_ensure_credits_fn(handle, credits, credits, 0);
+}
+
+static inline int ext4_journal_ensure_credits_batch(handle_t *handle,
+						    int credits)
+{
+	return ext4_journal_ensure_credits_fn(handle, credits,
+					      EXT4_MAX_TRANS_DATA, 0);
+}
+
 static inline int ext4_journal_blocks_per_page(struct inode *inode)
 {
 	if (EXT4_JOURNAL(inode) != NULL)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index fb0f99dc8c22..32f2c22c7ef2 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -100,29 +100,40 @@ static int ext4_split_extent_at(handle_t *handle,
 static int ext4_find_delayed_extent(struct inode *inode,
 				    struct extent_status *newes);
 
-static int ext4_ext_truncate_extend_restart(handle_t *handle,
-					    struct inode *inode,
-					    int needed)
+static int ext4_ext_trunc_restart_fn(struct inode *inode, int *dropped)
 {
-	int err;
-
-	if (!ext4_handle_valid(handle))
-		return 0;
-	if (handle->h_buffer_credits >= needed)
-		return 0;
 	/*
-	 * If we need to extend the journal get a few extra blocks
-	 * while we're at it for efficiency's sake.
+	 * Drop i_data_sem to avoid deadlock with ext4_map_blocks.  At this
+	 * moment, get_block can be called only for blocks inside i_size since
+	 * page cache has been already dropped and writes are blocked by
+	 * i_mutex. So we can safely drop the i_data_sem here.
 	 */
-	needed += 3;
-	err = ext4_journal_extend(handle, needed - handle->h_buffer_credits);
-	if (err <= 0)
-		return err;
-	err = ext4_truncate_restart_trans(handle, inode, needed);
-	if (err == 0)
-		err = -EAGAIN;
+	BUG_ON(EXT4_JOURNAL(inode) == NULL);
+	ext4_discard_preallocations(inode);
+	up_write(&EXT4_I(inode)->i_data_sem);
+	*dropped = 1;
+	return 0;
+}
 
-	return err;
+/*
+ * Make sure 'handle' has at least 'check_cred' credits. If not, restart
+ * transaction with 'restart_cred' credits. The function drops i_data_sem
+ * when restarting transaction and gets it after transaction is restarted.
+ *
+ * The function returns 0 on success, 1 if transaction had to be restarted,
+ * and < 0 in case of fatal error.
+ */
+int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode,
+				int check_cred, int restart_cred)
+{
+	int ret;
+	int dropped = 0;
+
+	ret = ext4_journal_ensure_credits_fn(handle, check_cred, restart_cred,
+			ext4_ext_trunc_restart_fn(inode, &dropped));
+	if (dropped)
+		down_write(&EXT4_I(inode)->i_data_sem);
+	return ret;
 }
 
 /*
@@ -2820,9 +2831,13 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
 		}
 		credits += EXT4_MAXQUOTAS_TRANS_BLOCKS(inode->i_sb);
 
-		err = ext4_ext_truncate_extend_restart(handle, inode, credits);
-		if (err)
+		err = ext4_datasem_ensure_credits(handle, inode, credits,
+						  credits);
+		if (err) {
+			if (err > 0)
+				err = -EAGAIN;
 			goto out;
+		}
 
 		err = ext4_ext_get_access(handle, inode, path + depth);
 		if (err)
@@ -5206,13 +5221,10 @@ ext4_access_path(handle_t *handle, struct inode *inode,
 	 * descriptor) for each block group; assume two block
 	 * groups
 	 */
-	if (handle->h_buffer_credits < 7) {
-		credits = ext4_writepage_trans_blocks(inode);
-		err = ext4_ext_truncate_extend_restart(handle, inode, credits);
-		/* EAGAIN is success */
-		if (err && err != -EAGAIN)
-			return err;
-	}
+	credits = ext4_writepage_trans_blocks(inode);
+	err = ext4_datasem_ensure_credits(handle, inode, 7, credits);
+	if (err < 0)
+		return err;
 
 	err = ext4_ext_get_access(handle, inode, path);
 	return err;
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 602abae08387..63e1d5846442 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -699,27 +699,62 @@ int ext4_ind_trans_blocks(struct inode *inode, int nrblocks)
 	return DIV_ROUND_UP(nrblocks, EXT4_ADDR_PER_BLOCK(inode->i_sb)) + 4;
 }
 
+static int ext4_ind_trunc_restart_fn(handle_t *handle, struct inode *inode,
+				     struct buffer_head *bh, int *dropped)
+{
+	int err;
+
+	if (bh) {
+		BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
+		err = ext4_handle_dirty_metadata(handle, inode, bh);
+		if (unlikely(err))
+			return err;
+	}
+	err = ext4_mark_inode_dirty(handle, inode);
+	if (unlikely(err))
+		return err;
+	/*
+	 * Drop i_data_sem to avoid deadlock with ext4_map_blocks.  At this
+	 * moment, get_block can be called only for blocks inside i_size since
+	 * page cache has been already dropped and writes are blocked by
+	 * i_mutex. So we can safely drop the i_data_sem here.
+	 */
+	BUG_ON(EXT4_JOURNAL(inode) == NULL);
+	ext4_discard_preallocations(inode);
+	up_write(&EXT4_I(inode)->i_data_sem);
+	*dropped = 1;
+	return 0;
+}
+
 /*
  * Truncate transactions can be complex and absolutely huge.  So we need to
  * be able to restart the transaction at a conventient checkpoint to make
  * sure we don't overflow the journal.
  *
  * Try to extend this transaction for the purposes of truncation.  If
- * extend fails, we need to propagate the failure up and restart the
- * transaction in the top-level truncate loop. --sct
- *
- * Returns 0 if we managed to create more room.  If we can't create more
- * room, and the transaction must be restarted we return 1.
+ * extend fails, we restart transaction.
  */
-static int try_to_extend_transaction(handle_t *handle, struct inode *inode)
+static int ext4_ind_truncate_ensure_credits(handle_t *handle,
+					    struct inode *inode,
+					    struct buffer_head *bh)
 {
-	if (!ext4_handle_valid(handle))
-		return 0;
-	if (ext4_handle_has_enough_credits(handle, EXT4_RESERVE_TRANS_BLOCKS+1))
-		return 0;
-	if (!ext4_journal_extend(handle, ext4_blocks_for_truncate(inode)))
-		return 0;
-	return 1;
+	int ret;
+	int dropped = 0;
+
+	ret = ext4_journal_ensure_credits_fn(handle, EXT4_RESERVE_TRANS_BLOCKS,
+			ext4_blocks_for_truncate(inode),
+			ext4_ind_trunc_restart_fn(handle, inode, bh, &dropped));
+	if (dropped)
+		down_write(&EXT4_I(inode)->i_data_sem);
+	if (ret <= 0)
+		return ret;
+	if (bh) {
+		BUFFER_TRACE(bh, "retaking write access");
+		ret = ext4_journal_get_write_access(handle, bh);
+		if (unlikely(ret))
+			return ret;
+	}
+	return 0;
 }
 
 /*
@@ -854,27 +889,9 @@ static int ext4_clear_blocks(handle_t *handle, struct inode *inode,
 		return 1;
 	}
 
-	if (try_to_extend_transaction(handle, inode)) {
-		if (bh) {
-			BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
-			err = ext4_handle_dirty_metadata(handle, inode, bh);
-			if (unlikely(err))
-				goto out_err;
-		}
-		err = ext4_mark_inode_dirty(handle, inode);
-		if (unlikely(err))
-			goto out_err;
-		err = ext4_truncate_restart_trans(handle, inode,
-					ext4_blocks_for_truncate(inode));
-		if (unlikely(err))
-			goto out_err;
-		if (bh) {
-			BUFFER_TRACE(bh, "retaking write access");
-			err = ext4_journal_get_write_access(handle, bh);
-			if (unlikely(err))
-				goto out_err;
-		}
-	}
+	err = ext4_ind_truncate_ensure_credits(handle, inode, bh);
+	if (err < 0)
+		goto out_err;
 
 	for (p = first; p < last; p++)
 		*p = 0;
@@ -1057,11 +1074,9 @@ static void ext4_free_branches(handle_t *handle, struct inode *inode,
 			 */
 			if (ext4_handle_is_aborted(handle))
 				return;
-			if (try_to_extend_transaction(handle, inode)) {
-				ext4_mark_inode_dirty(handle, inode);
-				ext4_truncate_restart_trans(handle, inode,
-					    ext4_blocks_for_truncate(inode));
-			}
+			if (ext4_ind_truncate_ensure_credits(handle, inode,
+							     NULL) < 0)
+				return;
 
 			/*
 			 * The forget flag here is critical because if
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 042d23a81f44..5a60176edc25 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -163,32 +163,6 @@ int ext4_inode_is_fast_symlink(struct inode *inode)
 	       (inode->i_size < EXT4_N_BLOCKS * 4);
 }
 
-/*
- * Restart the transaction associated with *handle.  This does a commit,
- * so before we call here everything must be consistently dirtied against
- * this transaction.
- */
-int ext4_truncate_restart_trans(handle_t *handle, struct inode *inode,
-				 int nblocks)
-{
-	int ret;
-
-	/*
-	 * Drop i_data_sem to avoid deadlock with ext4_map_blocks.  At this
-	 * moment, get_block can be called only for blocks inside i_size since
-	 * page cache has been already dropped and writes are blocked by
-	 * i_mutex. So we can safely drop the i_data_sem here.
-	 */
-	BUG_ON(EXT4_JOURNAL(inode) == NULL);
-	jbd_debug(2, "restarting handle %p\n", handle);
-	up_write(&EXT4_I(inode)->i_data_sem);
-	ret = ext4_journal_restart(handle, nblocks);
-	down_write(&EXT4_I(inode)->i_data_sem);
-	ext4_discard_preallocations(inode);
-
-	return ret;
-}
-
 /*
  * Called at the last iput() if i_nlink is zero.
  */
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index b1e4d359f73b..65f09dc9d941 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -50,29 +50,9 @@ static int finish_range(handle_t *handle, struct inode *inode,
 	needed = ext4_ext_calc_credits_for_single_extent(inode,
 		    lb->last_block - lb->first_block + 1, path);
 
-	/*
-	 * Make sure the credit we accumalated is not really high
-	 */
-	if (needed && ext4_handle_has_enough_credits(handle,
-						EXT4_RESERVE_TRANS_BLOCKS)) {
-		up_write((&EXT4_I(inode)->i_data_sem));
-		retval = ext4_journal_restart(handle, needed);
-		down_write((&EXT4_I(inode)->i_data_sem));
-		if (retval)
-			goto err_out;
-	} else if (needed) {
-		retval = ext4_journal_extend(handle, needed);
-		if (retval) {
-			/*
-			 * IF not able to extend the journal restart the journal
-			 */
-			up_write((&EXT4_I(inode)->i_data_sem));
-			retval = ext4_journal_restart(handle, needed);
-			down_write((&EXT4_I(inode)->i_data_sem));
-			if (retval)
-				goto err_out;
-		}
-	}
+	retval = ext4_datasem_ensure_credits(handle, inode, needed, needed);
+	if (retval < 0)
+		goto err_out;
 	retval = ext4_ext_insert_extent(handle, inode, &path, &newext, 0);
 err_out:
 	up_write((&EXT4_I(inode)->i_data_sem));
@@ -196,26 +176,6 @@ static int update_tind_extent_range(handle_t *handle, struct inode *inode,
 
 }
 
-static int extend_credit_for_blkdel(handle_t *handle, struct inode *inode)
-{
-	int retval = 0, needed;
-
-	if (ext4_handle_has_enough_credits(handle, EXT4_RESERVE_TRANS_BLOCKS+1))
-		return 0;
-	/*
-	 * We are freeing a blocks. During this we touch
-	 * superblock, group descriptor and block bitmap.
-	 * So allocate a credit of 3. We may update
-	 * quota (user and group).
-	 */
-	needed = 3 + EXT4_MAXQUOTAS_TRANS_BLOCKS(inode->i_sb);
-
-	if (ext4_journal_extend(handle, needed) != 0)
-		retval = ext4_journal_restart(handle, needed);
-
-	return retval;
-}
-
 static int free_dind_blocks(handle_t *handle,
 				struct inode *inode, __le32 i_data)
 {
@@ -223,6 +183,7 @@ static int free_dind_blocks(handle_t *handle,
 	__le32 *tmp_idata;
 	struct buffer_head *bh;
 	unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+	int err;
 
 	bh = ext4_sb_bread(inode->i_sb, le32_to_cpu(i_data), 0);
 	if (IS_ERR(bh))
@@ -231,7 +192,12 @@ static int free_dind_blocks(handle_t *handle,
 	tmp_idata = (__le32 *)bh->b_data;
 	for (i = 0; i < max_entries; i++) {
 		if (tmp_idata[i]) {
-			extend_credit_for_blkdel(handle, inode);
+			err = ext4_journal_ensure_credits(handle,
+						EXT4_RESERVE_TRANS_BLOCKS);
+			if (err < 0) {
+				put_bh(bh);
+				return err;
+			}
 			ext4_free_blocks(handle, inode, NULL,
 					 le32_to_cpu(tmp_idata[i]), 1,
 					 EXT4_FREE_BLOCKS_METADATA |
@@ -239,7 +205,9 @@ static int free_dind_blocks(handle_t *handle,
 		}
 	}
 	put_bh(bh);
-	extend_credit_for_blkdel(handle, inode);
+	err = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS);
+	if (err < 0)
+		return err;
 	ext4_free_blocks(handle, inode, NULL, le32_to_cpu(i_data), 1,
 			 EXT4_FREE_BLOCKS_METADATA |
 			 EXT4_FREE_BLOCKS_FORGET);
@@ -270,7 +238,9 @@ static int free_tind_blocks(handle_t *handle,
 		}
 	}
 	put_bh(bh);
-	extend_credit_for_blkdel(handle, inode);
+	retval = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS);
+	if (retval < 0)
+		return retval;
 	ext4_free_blocks(handle, inode, NULL, le32_to_cpu(i_data), 1,
 			 EXT4_FREE_BLOCKS_METADATA |
 			 EXT4_FREE_BLOCKS_FORGET);
@@ -283,7 +253,10 @@ static int free_ind_block(handle_t *handle, struct inode *inode, __le32 *i_data)
 
 	/* ei->i_data[EXT4_IND_BLOCK] */
 	if (i_data[0]) {
-		extend_credit_for_blkdel(handle, inode);
+		retval = ext4_journal_ensure_credits(handle,
+						     EXT4_RESERVE_TRANS_BLOCKS);
+		if (retval < 0)
+			return retval;
 		ext4_free_blocks(handle, inode, NULL,
 				le32_to_cpu(i_data[0]), 1,
 				 EXT4_FREE_BLOCKS_METADATA |
@@ -318,12 +291,9 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
 	 * One credit accounted for writing the
 	 * i_data field of the original inode
 	 */
-	retval = ext4_journal_extend(handle, 1);
-	if (retval) {
-		retval = ext4_journal_restart(handle, 1);
-		if (retval)
-			goto err_out;
-	}
+	retval = ext4_journal_ensure_credits(handle, 1);
+	if (retval < 0)
+		goto err_out;
 
 	i_data[0] = ei->i_data[EXT4_IND_BLOCK];
 	i_data[1] = ei->i_data[EXT4_DIND_BLOCK];
@@ -391,15 +361,19 @@ static int free_ext_idx(handle_t *handle, struct inode *inode,
 		ix = EXT_FIRST_INDEX(eh);
 		for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
 			retval = free_ext_idx(handle, inode, ix);
-			if (retval)
-				break;
+			if (retval) {
+				put_bh(bh);
+				return retval;
+			}
 		}
 	}
 	put_bh(bh);
-	extend_credit_for_blkdel(handle, inode);
+	retval = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS);
+	if (retval < 0)
+		return retval;
 	ext4_free_blocks(handle, inode, NULL, block, 1,
 			 EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET);
-	return retval;
+	return 0;
 }
 
 /*
@@ -574,9 +548,9 @@ int ext4_ext_migrate(struct inode *inode)
 	}
 
 	/* We mark the tmp_inode dirty via ext4_ext_tree_init. */
-	if (ext4_journal_extend(handle, 1) != 0)
-		ext4_journal_restart(handle, 1);
-
+	retval = ext4_journal_ensure_credits(handle, 1);
+	if (retval < 0)
+		goto out_stop;
 	/*
 	 * Mark the tmp_inode as of size zero
 	 */
@@ -594,6 +568,7 @@ int ext4_ext_migrate(struct inode *inode)
 
 	/* Reset the extent details */
 	ext4_ext_tree_init(handle, tmp_inode);
+out_stop:
 	ext4_journal_stop(handle);
 out:
 	unlock_new_inode(tmp_inode);
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index c0e9aef376a7..3e4286b3901f 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -388,30 +388,6 @@ static struct buffer_head *bclean(handle_t *handle, struct super_block *sb,
 	return bh;
 }
 
-/*
- * If we have fewer than thresh credits, extend by EXT4_MAX_TRANS_DATA.
- * If that fails, restart the transaction & regain write access for the
- * buffer head which is used for block_bitmap modifications.
- */
-static int extend_or_restart_transaction(handle_t *handle, int thresh)
-{
-	int err;
-
-	if (ext4_handle_has_enough_credits(handle, thresh))
-		return 0;
-
-	err = ext4_journal_extend(handle, EXT4_MAX_TRANS_DATA);
-	if (err < 0)
-		return err;
-	if (err) {
-		err = ext4_journal_restart(handle, EXT4_MAX_TRANS_DATA);
-		if (err)
-			return err;
-	}
-
-	return 0;
-}
-
 /*
  * set_flexbg_block_bitmap() mark clusters [@first_cluster, @last_cluster] used.
  *
@@ -451,8 +427,8 @@ static int set_flexbg_block_bitmap(struct super_block *sb, handle_t *handle,
 			continue;
 		}
 
-		err = extend_or_restart_transaction(handle, 1);
-		if (err)
+		err = ext4_journal_ensure_credits_batch(handle, 1);
+		if (err < 0)
 			return err;
 
 		bh = sb_getblk(sb, flex_gd->groups[group].block_bitmap);
@@ -544,8 +520,8 @@ static int setup_new_flex_group_blocks(struct super_block *sb,
 			struct buffer_head *gdb;
 
 			ext4_debug("update backup group %#04llx\n", block);
-			err = extend_or_restart_transaction(handle, 1);
-			if (err)
+			err = ext4_journal_ensure_credits_batch(handle, 1);
+			if (err < 0)
 				goto out;
 
 			gdb = sb_getblk(sb, block);
@@ -602,8 +578,8 @@ static int setup_new_flex_group_blocks(struct super_block *sb,
 
 		/* Initialize block bitmap of the @group */
 		block = group_data[i].block_bitmap;
-		err = extend_or_restart_transaction(handle, 1);
-		if (err)
+		err = ext4_journal_ensure_credits_batch(handle, 1);
+		if (err < 0)
 			goto out;
 
 		bh = bclean(handle, sb, block);
@@ -631,8 +607,8 @@ static int setup_new_flex_group_blocks(struct super_block *sb,
 
 		/* Initialize inode bitmap of the @group */
 		block = group_data[i].inode_bitmap;
-		err = extend_or_restart_transaction(handle, 1);
-		if (err)
+		err = ext4_journal_ensure_credits_batch(handle, 1);
+		if (err < 0)
 			goto out;
 		/* Mark unused entries in inode bitmap used */
 		bh = bclean(handle, sb, block);
@@ -1109,10 +1085,8 @@ static void update_backups(struct super_block *sb, sector_t blk_off, char *data,
 		ext4_fsblk_t backup_block;
 
 		/* Out of journal space, and can't get more - abort - so sad */
-		if (ext4_handle_valid(handle) &&
-		    handle->h_buffer_credits == 0 &&
-		    ext4_journal_extend(handle, EXT4_MAX_TRANS_DATA) &&
-		    (err = ext4_journal_restart(handle, EXT4_MAX_TRANS_DATA)))
+		err = ext4_journal_ensure_credits_batch(handle, 1);
+		if (err < 0)
 			break;
 
 		if (meta_bg == 0)
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 491f9ee4040e..b79d8ffd3e9b 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -967,55 +967,6 @@ int __ext4_xattr_set_credits(struct super_block *sb, struct inode *inode,
 	return credits;
 }
 
-static int ext4_xattr_ensure_credits(handle_t *handle, struct inode *inode,
-				     int credits, struct buffer_head *bh,
-				     bool dirty, bool block_csum)
-{
-	int error;
-
-	if (!ext4_handle_valid(handle))
-		return 0;
-
-	if (handle->h_buffer_credits >= credits)
-		return 0;
-
-	error = ext4_journal_extend(handle, credits - handle->h_buffer_credits);
-	if (!error)
-		return 0;
-	if (error < 0) {
-		ext4_warning(inode->i_sb, "Extend journal (error %d)", error);
-		return error;
-	}
-
-	if (bh && dirty) {
-		if (block_csum)
-			ext4_xattr_block_csum_set(inode, bh);
-		error = ext4_handle_dirty_metadata(handle, NULL, bh);
-		if (error) {
-			ext4_warning(inode->i_sb, "Handle metadata (error %d)",
-				     error);
-			return error;
-		}
-	}
-
-	error = ext4_journal_restart(handle, credits);
-	if (error) {
-		ext4_warning(inode->i_sb, "Restart journal (error %d)", error);
-		return error;
-	}
-
-	if (bh) {
-		error = ext4_journal_get_write_access(handle, bh);
-		if (error) {
-			ext4_warning(inode->i_sb,
-				     "Get write access failed (error %d)",
-				     error);
-			return error;
-		}
-	}
-	return 0;
-}
-
 static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
 				       int ref_change)
 {
@@ -1149,6 +1100,24 @@ static int ext4_xattr_inode_inc_ref_all(handle_t *handle, struct inode *parent,
 	return saved_err;
 }
 
+static int ext4_xattr_restart_fn(handle_t *handle, struct inode *inode,
+			struct buffer_head *bh, bool block_csum, bool dirty)
+{
+	int error;
+
+	if (bh && dirty) {
+		if (block_csum)
+			ext4_xattr_block_csum_set(inode, bh);
+		error = ext4_handle_dirty_metadata(handle, NULL, bh);
+		if (error) {
+			ext4_warning(inode->i_sb, "Handle metadata (error %d)",
+				     error);
+			return error;
+		}
+	}
+	return 0;
+}
+
 static void
 ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 			     struct buffer_head *bh,
@@ -1185,13 +1154,23 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 			continue;
 		}
 
-		err = ext4_xattr_ensure_credits(handle, parent, credits, bh,
-						dirty, block_csum);
-		if (err) {
+		err = ext4_journal_ensure_credits_fn(handle, credits, credits,
+			ext4_xattr_restart_fn(handle, parent, bh, block_csum,
+					      dirty));
+		if (err < 0) {
 			ext4_warning_inode(ea_inode, "Ensure credits err=%d",
 					   err);
 			continue;
 		}
+		if (err > 0) {
+			err = ext4_journal_get_write_access(handle, bh);
+			if (err) {
+				ext4_warning_inode(ea_inode,
+						"Re-get write access err=%d",
+						err);
+				continue;
+			}
+		}
 
 		err = ext4_xattr_inode_dec_ref(handle, ea_inode);
 		if (err) {
@@ -2862,11 +2841,8 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 	struct inode *ea_inode;
 	int error;
 
-	error = ext4_xattr_ensure_credits(handle, inode, extra_credits,
-					  NULL /* bh */,
-					  false /* dirty */,
-					  false /* block_csum */);
-	if (error) {
+	error = ext4_journal_ensure_credits(handle, extra_credits);
+	if (error < 0) {
 		EXT4_ERROR_INODE(inode, "ensure credits (error %d)", error);
 		goto cleanup;
 	}
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 09/22] ext4, jbd2: Provide accessor function for handle credits
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (7 preceding siblings ...)
  2019-10-03 22:05 ` [PATCH 08/22] ext4: Provide function to handle transaction restarts Jan Kara
@ 2019-10-03 22:05 ` Jan Kara
  2019-10-21 16:21   ` Theodore Y. Ts'o
  2019-10-03 22:05 ` [PATCH 10/22] ocfs2: Use accessor function for h_buffer_credits Jan Kara
                   ` (42 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

Provide accessor function to get number of credits available in a handle
and use it from ext4. Later, computation of available credits won't be
so straightforward.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4_jbd2.c  | 13 +++++++------
 fs/ext4/ext4_jbd2.h  |  7 -------
 fs/ext4/xattr.c      |  2 +-
 include/linux/jbd2.h |  6 ++++++
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 2b98d893cda9..731bbfdbce5b 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -119,8 +119,8 @@ handle_t *__ext4_journal_start_reserved(handle_t *handle, unsigned int line,
 		return ext4_get_nojournal();
 
 	sb = handle->h_journal->j_private;
-	trace_ext4_journal_start_reserved(sb, handle->h_buffer_credits,
-					  _RET_IP_);
+	trace_ext4_journal_start_reserved(sb,
+				jbd2_handle_buffer_credits(handle), _RET_IP_);
 	err = ext4_journal_check_start(sb);
 	if (err < 0) {
 		jbd2_journal_free_reserved(handle);
@@ -138,10 +138,10 @@ int __ext4_journal_ensure_credits(handle_t *handle, int check_cred,
 {
 	if (!ext4_handle_valid(handle))
 		return 0;
-	if (handle->h_buffer_credits >= check_cred)
+	if (jbd2_handle_buffer_credits(handle) >= check_cred)
 		return 0;
 	return ext4_journal_extend(handle,
-				   extend_cred - handle->h_buffer_credits);
+			   extend_cred - jbd2_handle_buffer_credits(handle));
 }
 
 static void ext4_journal_abort_handle(const char *caller, unsigned int line,
@@ -289,7 +289,7 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
 				       handle->h_type,
 				       handle->h_line_no,
 				       handle->h_requested_credits,
-				       handle->h_buffer_credits, err);
+				       jbd2_handle_buffer_credits(handle), err);
 				return err;
 			}
 			ext4_error_inode(inode, where, line,
@@ -300,7 +300,8 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
 					 handle->h_type,
 					 handle->h_line_no,
 					 handle->h_requested_credits,
-					 handle->h_buffer_credits, err);
+					 jbd2_handle_buffer_credits(handle),
+					 err);
 		}
 	} else {
 		if (inode)
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 1920b976eef1..36aa72599646 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -288,13 +288,6 @@ static inline int ext4_handle_is_aborted(handle_t *handle)
 	return 0;
 }
 
-static inline int ext4_handle_has_enough_credits(handle_t *handle, int needed)
-{
-	if (ext4_handle_valid(handle) && handle->h_buffer_credits < needed)
-		return 0;
-	return 1;
-}
-
 #define ext4_journal_start_sb(sb, type, nblocks)			\
 	__ext4_journal_start_sb((sb), __LINE__, (type), (nblocks), 0)
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index b79d8ffd3e9b..48a9dbd27f43 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -2314,7 +2314,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 						   flags & XATTR_CREATE);
 		brelse(bh);
 
-		if (!ext4_handle_has_enough_credits(handle, credits)) {
+		if (jbd2_handle_buffer_credits(handle) < credits) {
 			error = -ENOSPC;
 			goto cleanup;
 		}
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 10e6049c0ba9..727ff91d7f3e 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1645,6 +1645,12 @@ static inline tid_t  jbd2_get_latest_transaction(journal_t *journal)
 	return tid;
 }
 
+
+static inline int jbd2_handle_buffer_credits(handle_t *handle)
+{
+	return handle->h_buffer_credits;
+}
+
 #ifdef __KERNEL__
 
 #define buffer_trace_init(bh)	do {} while (0)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 10/22] ocfs2: Use accessor function for h_buffer_credits
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (8 preceding siblings ...)
  2019-10-03 22:05 ` [PATCH 09/22] ext4, jbd2: Provide accessor function for handle credits Jan Kara
@ 2019-10-03 22:05 ` Jan Kara
  2019-10-21 16:21   ` Theodore Y. Ts'o
  2019-10-03 22:05 ` [PATCH 11/22] jbd2: Fix statistics for the number of logged blocks Jan Kara
                   ` (41 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

Use the jbd2 accessor function for h_buffer_credits.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/alloc.c   | 32 ++++++++++++++++----------------
 fs/ocfs2/journal.c |  4 ++--
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index f9baefc76cf9..88534eb0e7c2 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -2288,9 +2288,9 @@ static int ocfs2_extend_rotate_transaction(handle_t *handle, int subtree_depth,
 	int ret = 0;
 	int credits = (path->p_tree_depth - subtree_depth) * 2 + 1 + op_credits;
 
-	if (handle->h_buffer_credits < credits)
+	if (jbd2_handle_buffer_credits(handle) < credits)
 		ret = ocfs2_extend_trans(handle,
-					 credits - handle->h_buffer_credits);
+				credits - jbd2_handle_buffer_credits(handle));
 
 	return ret;
 }
@@ -2367,7 +2367,7 @@ static int ocfs2_rotate_tree_right(handle_t *handle,
 				   struct ocfs2_path *right_path,
 				   struct ocfs2_path **ret_left_path)
 {
-	int ret, start, orig_credits = handle->h_buffer_credits;
+	int ret, start, orig_credits = jbd2_handle_buffer_credits(handle);
 	u32 cpos;
 	struct ocfs2_path *left_path = NULL;
 	struct super_block *sb = ocfs2_metadata_cache_get_super(et->et_ci);
@@ -3148,7 +3148,7 @@ static int ocfs2_rotate_tree_left(handle_t *handle,
 				  struct ocfs2_path *path,
 				  struct ocfs2_cached_dealloc_ctxt *dealloc)
 {
-	int ret, orig_credits = handle->h_buffer_credits;
+	int ret, orig_credits = jbd2_handle_buffer_credits(handle);
 	struct ocfs2_path *tmp_path = NULL, *restart_path = NULL;
 	struct ocfs2_extent_block *eb;
 	struct ocfs2_extent_list *el;
@@ -3386,8 +3386,8 @@ static int ocfs2_merge_rec_right(struct ocfs2_path *left_path,
 							right_path);
 
 		ret = ocfs2_extend_rotate_transaction(handle, subtree_index,
-						      handle->h_buffer_credits,
-						      right_path);
+					jbd2_handle_buffer_credits(handle),
+					right_path);
 		if (ret) {
 			mlog_errno(ret);
 			goto out;
@@ -3548,8 +3548,8 @@ static int ocfs2_merge_rec_left(struct ocfs2_path *right_path,
 							right_path);
 
 		ret = ocfs2_extend_rotate_transaction(handle, subtree_index,
-						      handle->h_buffer_credits,
-						      left_path);
+					jbd2_handle_buffer_credits(handle),
+					left_path);
 		if (ret) {
 			mlog_errno(ret);
 			goto out;
@@ -3623,7 +3623,7 @@ static int ocfs2_merge_rec_left(struct ocfs2_path *right_path,
 		    le16_to_cpu(el->l_next_free_rec) == 1) {
 			/* extend credit for ocfs2_remove_rightmost_path */
 			ret = ocfs2_extend_rotate_transaction(handle, 0,
-					handle->h_buffer_credits,
+					jbd2_handle_buffer_credits(handle),
 					right_path);
 			if (ret) {
 				mlog_errno(ret);
@@ -3669,7 +3669,7 @@ static int ocfs2_try_to_merge_extent(handle_t *handle,
 	if (ctxt->c_split_covers_rec && ctxt->c_has_empty_extent) {
 		/* extend credit for ocfs2_remove_rightmost_path */
 		ret = ocfs2_extend_rotate_transaction(handle, 0,
-				handle->h_buffer_credits,
+				jbd2_handle_buffer_credits(handle),
 				path);
 		if (ret) {
 			mlog_errno(ret);
@@ -3725,7 +3725,7 @@ static int ocfs2_try_to_merge_extent(handle_t *handle,
 
 		/* extend credit for ocfs2_remove_rightmost_path */
 		ret = ocfs2_extend_rotate_transaction(handle, 0,
-					handle->h_buffer_credits,
+					jbd2_handle_buffer_credits(handle),
 					path);
 		if (ret) {
 			mlog_errno(ret);
@@ -3755,7 +3755,7 @@ static int ocfs2_try_to_merge_extent(handle_t *handle,
 
 		/* extend credit for ocfs2_remove_rightmost_path */
 		ret = ocfs2_extend_rotate_transaction(handle, 0,
-				handle->h_buffer_credits,
+				jbd2_handle_buffer_credits(handle),
 				path);
 		if (ret) {
 			mlog_errno(ret);
@@ -3799,7 +3799,7 @@ static int ocfs2_try_to_merge_extent(handle_t *handle,
 		if (ctxt->c_split_covers_rec) {
 			/* extend credit for ocfs2_remove_rightmost_path */
 			ret = ocfs2_extend_rotate_transaction(handle, 0,
-					handle->h_buffer_credits,
+					jbd2_handle_buffer_credits(handle),
 					path);
 			if (ret) {
 				mlog_errno(ret);
@@ -5358,7 +5358,7 @@ static int ocfs2_truncate_rec(handle_t *handle,
 	if (ocfs2_is_empty_extent(&el->l_recs[0]) && index > 0) {
 		/* extend credit for ocfs2_remove_rightmost_path */
 		ret = ocfs2_extend_rotate_transaction(handle, 0,
-				handle->h_buffer_credits,
+				jbd2_handle_buffer_credits(handle),
 				path);
 		if (ret) {
 			mlog_errno(ret);
@@ -5427,8 +5427,8 @@ static int ocfs2_truncate_rec(handle_t *handle,
 	}
 
 	ret = ocfs2_extend_rotate_transaction(handle, 0,
-					      handle->h_buffer_credits,
-					      path);
+					jbd2_handle_buffer_credits(handle),
+					path);
 	if (ret) {
 		mlog_errno(ret);
 		goto out;
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 930e3d388579..019aaf2a3f8a 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -419,7 +419,7 @@ int ocfs2_extend_trans(handle_t *handle, int nblocks)
 	if (!nblocks)
 		return 0;
 
-	old_nblocks = handle->h_buffer_credits;
+	old_nblocks = jbd2_handle_buffer_credits(handle);
 
 	trace_ocfs2_extend_trans(old_nblocks, nblocks);
 
@@ -460,7 +460,7 @@ int ocfs2_allocate_extend_trans(handle_t *handle, int thresh)
 
 	BUG_ON(!handle);
 
-	old_nblks = handle->h_buffer_credits;
+	old_nblks = jbd2_handle_buffer_credits(handle);
 	trace_ocfs2_allocate_extend_trans(old_nblks, thresh);
 
 	if (old_nblks < thresh)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 11/22] jbd2: Fix statistics for the number of logged blocks
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (9 preceding siblings ...)
  2019-10-03 22:05 ` [PATCH 10/22] ocfs2: Use accessor function for h_buffer_credits Jan Kara
@ 2019-10-03 22:05 ` Jan Kara
  2019-10-21 16:24   ` Theodore Y. Ts'o
  2019-10-03 22:05 ` [PATCH 12/22] jbd2: Reorganize jbd2_journal_stop() Jan Kara
                   ` (40 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

jbd2 statistics counting number of blocks logged in a transaction was
wrong. It didn't count the commit block and more importantly it didn't
count revoke descriptor blocks. Make sure these get properly counted.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/commit.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index c6d39f2ad828..b67e2d0cff88 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -726,7 +726,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 				submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
 			}
 			cond_resched();
-			stats.run.rs_blocks_logged += bufs;
 
 			/* Force a new descriptor to be generated next
                            time round the loop. */
@@ -813,6 +812,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 		if (unlikely(!buffer_uptodate(bh)))
 			err = -EIO;
 		jbd2_unfile_log_bh(bh);
+		stats.run.rs_blocks_logged++;
 
 		/*
 		 * The list contains temporary buffer heads created by
@@ -858,6 +858,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 		BUFFER_TRACE(bh, "ph5: control buffer writeout done: unfile");
 		clear_buffer_jwrite(bh);
 		jbd2_unfile_log_bh(bh);
+		stats.run.rs_blocks_logged++;
 		__brelse(bh);		/* One for getblk */
 		/* AKPM: bforget here */
 	}
@@ -879,6 +880,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	}
 	if (cbh)
 		err = journal_wait_on_commit_record(journal, cbh);
+	stats.run.rs_blocks_logged++;
 	if (jbd2_has_feature_async_commit(journal) &&
 	    journal->j_flags & JBD2_BARRIER) {
 		blkdev_issue_flush(journal->j_dev, GFP_NOFS, NULL);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 12/22] jbd2: Reorganize jbd2_journal_stop()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (10 preceding siblings ...)
  2019-10-03 22:05 ` [PATCH 11/22] jbd2: Fix statistics for the number of logged blocks Jan Kara
@ 2019-10-03 22:05 ` Jan Kara
  2019-10-21 17:29   ` Theodore Y. Ts'o
  2019-10-03 22:05 ` [PATCH 13/22] jbd2: Drop pointless check from jbd2_journal_stop() Jan Kara
                   ` (39 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

Move code in jbd2_journal_stop() around a bit. It removes some
unnecessary code duplication and will make factoring out parts common
with jbd2__journal_restart() easier.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 40 ++++++++++++++++------------------------
 1 file changed, 16 insertions(+), 24 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index bee8498d7792..6f560713f7f0 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1706,41 +1706,34 @@ int jbd2_journal_stop(handle_t *handle)
 	tid_t tid;
 	pid_t pid;
 
+	if (--handle->h_ref > 0) {
+		jbd_debug(4, "h_ref %d -> %d\n", handle->h_ref + 1,
+						 handle->h_ref);
+		if (is_handle_aborted(handle))
+			return -EIO;
+		return 0;
+	}
 	if (!transaction) {
 		/*
-		 * Handle is already detached from the transaction so
-		 * there is nothing to do other than decrease a refcount,
-		 * or free the handle if refcount drops to zero
+		 * Handle is already detached from the transaction so there is
+		 * nothing to do other than free the handle.
 		 */
-		if (--handle->h_ref > 0) {
-			jbd_debug(4, "h_ref %d -> %d\n", handle->h_ref + 1,
-							 handle->h_ref);
-			return err;
-		} else {
-			if (handle->h_rsv_handle)
-				jbd2_free_handle(handle->h_rsv_handle);
-			goto free_and_exit;
-		}
+		if (handle->h_rsv_handle)
+			jbd2_free_handle(handle->h_rsv_handle);
+		goto free_and_exit;
 	}
 	journal = transaction->t_journal;
+	tid = transaction->t_tid;
 
 	J_ASSERT(journal_current_handle() == handle);
+	J_ASSERT(atomic_read(&transaction->t_updates) > 0);
 
 	if (is_handle_aborted(handle))
 		err = -EIO;
-	else
-		J_ASSERT(atomic_read(&transaction->t_updates) > 0);
-
-	if (--handle->h_ref > 0) {
-		jbd_debug(4, "h_ref %d -> %d\n", handle->h_ref + 1,
-			  handle->h_ref);
-		return err;
-	}
 
 	jbd_debug(4, "Handle %p going down\n", handle);
 	trace_jbd2_handle_stats(journal->j_fs_dev->bd_dev,
-				transaction->t_tid,
-				handle->h_type, handle->h_line_no,
+				tid, handle->h_type, handle->h_line_no,
 				jiffies - handle->h_start_jiffies,
 				handle->h_sync, handle->h_requested_credits,
 				(handle->h_requested_credits -
@@ -1825,7 +1818,7 @@ int jbd2_journal_stop(handle_t *handle)
 		jbd_debug(2, "transaction too old, requesting commit for "
 					"handle %p\n", handle);
 		/* This is non-blocking */
-		jbd2_log_start_commit(journal, transaction->t_tid);
+		jbd2_log_start_commit(journal, tid);
 
 		/*
 		 * Special case: JBD2_SYNC synchronous updates require us
@@ -1841,7 +1834,6 @@ int jbd2_journal_stop(handle_t *handle)
 	 * once we do this, we must not dereference transaction
 	 * pointer again.
 	 */
-	tid = transaction->t_tid;
 	if (atomic_dec_and_test(&transaction->t_updates)) {
 		wake_up(&journal->j_wait_updates);
 		if (journal->j_barrier_count)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 13/22] jbd2: Drop pointless check from jbd2_journal_stop()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (11 preceding siblings ...)
  2019-10-03 22:05 ` [PATCH 12/22] jbd2: Reorganize jbd2_journal_stop() Jan Kara
@ 2019-10-03 22:05 ` Jan Kara
  2019-10-21 17:30   ` Theodore Y. Ts'o
  2019-10-03 22:06 ` [PATCH 14/22] jbd2: Drop pointless wakeup " Jan Kara
                   ` (38 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:05 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

If a transaction is larger than journal->j_max_transaction_buffers, that
is a bug and not a trigger for transaction commit. Also the very next
attempt to start new handle will start transaction commit anyway. So
just remove the pointless check. Arguably, we could start transaction
commit whenever the transaction size is *close* to
journal->j_max_transaction_buffers. This has a potential to reduce
latency of the next jbd2_journal_start() at the cost of somewhat smaller
transactions. However for this to have any effect, it would mean that
there isn't someone already waiting in jbd2_journal_start() which means
metadata load for the fs is pretty light anyway so probably this
optimization is not worth it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 6f560713f7f0..a160c3f665f9 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1803,13 +1803,10 @@ int jbd2_journal_stop(handle_t *handle)
 
 	/*
 	 * If the handle is marked SYNC, we need to set another commit
-	 * going!  We also want to force a commit if the current
-	 * transaction is occupying too much of the log, or if the
-	 * transaction is too old now.
+	 * going!  We also want to force a commit if the transaction is too
+	 * old now.
 	 */
 	if (handle->h_sync ||
-	    (atomic_read(&transaction->t_outstanding_credits) >
-	     journal->j_max_transaction_buffers) ||
 	    time_after_eq(jiffies, transaction->t_expires)) {
 		/* Do this even for aborted journals: an abort still
 		 * completes the commit thread, it just doesn't write
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 14/22] jbd2: Drop pointless wakeup from jbd2_journal_stop()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (12 preceding siblings ...)
  2019-10-03 22:05 ` [PATCH 13/22] jbd2: Drop pointless check from jbd2_journal_stop() Jan Kara
@ 2019-10-03 22:06 ` Jan Kara
  2019-10-21 17:34   ` Theodore Y. Ts'o
  2019-10-03 22:06 ` [PATCH 15/22] jbd2: Factor out common parts of stopping and restarting a handle Jan Kara
                   ` (37 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:06 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

When we drop last handle from a transaction and journal->j_barrier_count
> 0, jbd2_journal_stop() wakes up journal->j_wait_transaction_locked
wait queue. This looks pointless - wait for outstanding handles always
happens on journal->j_wait_updates waitqueue.
journal->j_wait_transaction_locked is used to wait for transaction state
changes and by start_this_handle() for waiting until
journal->j_barrier_count drops to 0. The first case is clearly
irrelevant here since only jbd2 thread changes transaction state. The
second case looks related but jbd2_journal_unlock_updates() is
responsible for the wakeup in this case. So just drop the wakeup.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a160c3f665f9..d648cec3f90f 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1831,11 +1831,8 @@ int jbd2_journal_stop(handle_t *handle)
 	 * once we do this, we must not dereference transaction
 	 * pointer again.
 	 */
-	if (atomic_dec_and_test(&transaction->t_updates)) {
+	if (atomic_dec_and_test(&transaction->t_updates))
 		wake_up(&journal->j_wait_updates);
-		if (journal->j_barrier_count)
-			wake_up(&journal->j_wait_transaction_locked);
-	}
 
 	rwsem_release(&journal->j_trans_commit_map, 1, _THIS_IP_);
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 15/22] jbd2: Factor out common parts of stopping and restarting a handle
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (13 preceding siblings ...)
  2019-10-03 22:06 ` [PATCH 14/22] jbd2: Drop pointless wakeup " Jan Kara
@ 2019-10-03 22:06 ` Jan Kara
  2019-10-21 17:49   ` Theodore Y. Ts'o
  2019-10-03 22:06 ` [PATCH 16/22] jbd2: Account descriptor blocks into t_outstanding_credits Jan Kara
                   ` (36 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:06 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

jbd2__journal_restart() has quite some code that is common with
jbd2_journal_stop(). Factor this functionality into stop_this_handle()
helper and use it from both functions. Note that this also drops
t_handle_lock protection from jbd2__journal_restart() as
jbd2_journal_stop() does the same thing without it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 94 +++++++++++++++++++++++----------------------------
 1 file changed, 42 insertions(+), 52 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index d648cec3f90f..d4ee02e5161b 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -512,12 +512,17 @@ handle_t *jbd2_journal_start(journal_t *journal, int nblocks)
 }
 EXPORT_SYMBOL(jbd2_journal_start);
 
-void jbd2_journal_free_reserved(handle_t *handle)
+static void __jbd2_journal_unreserve_handle(handle_t *handle)
 {
 	journal_t *journal = handle->h_journal;
 
 	WARN_ON(!handle->h_reserved);
 	sub_reserved_credits(journal, handle->h_buffer_credits);
+}
+
+void jbd2_journal_free_reserved(handle_t *handle)
+{
+	__jbd2_journal_unreserve_handle(handle);
 	jbd2_free_handle(handle);
 }
 EXPORT_SYMBOL(jbd2_journal_free_reserved);
@@ -655,6 +660,28 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
 	return result;
 }
 
+static void stop_this_handle(handle_t *handle)
+{
+	transaction_t *transaction = handle->h_transaction;
+	journal_t *journal = transaction->t_journal;
+
+	J_ASSERT(journal_current_handle() == handle);
+	J_ASSERT(atomic_read(&transaction->t_updates) > 0);
+	current->journal_info = NULL;
+	atomic_sub(handle->h_buffer_credits,
+		   &transaction->t_outstanding_credits);
+	if (handle->h_rsv_handle)
+		__jbd2_journal_unreserve_handle(handle->h_rsv_handle);
+	if (atomic_dec_and_test(&transaction->t_updates))
+		wake_up(&journal->j_wait_updates);
+
+	rwsem_release(&journal->j_trans_commit_map, 1, _THIS_IP_);
+	/*
+	 * Scope of the GFP_NOFS context is over here and so we can restore the
+	 * original alloc context.
+	 */
+	memalloc_nofs_restore(handle->saved_alloc_context);
+}
 
 /**
  * int jbd2_journal_restart() - restart a handle .
@@ -677,52 +704,30 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, gfp_t gfp_mask)
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal;
 	tid_t		tid;
-	int		need_to_start, ret;
+	int		need_to_start;
 
 	/* If we've had an abort of any type, don't even think about
 	 * actually doing the restart! */
 	if (is_handle_aborted(handle))
 		return 0;
 	journal = transaction->t_journal;
+	tid = transaction->t_tid;
 
 	/*
 	 * First unlink the handle from its current transaction, and start the
 	 * commit on that.
 	 */
-	J_ASSERT(atomic_read(&transaction->t_updates) > 0);
-	J_ASSERT(journal_current_handle() == handle);
-
-	read_lock(&journal->j_state_lock);
-	spin_lock(&transaction->t_handle_lock);
-	atomic_sub(handle->h_buffer_credits,
-		   &transaction->t_outstanding_credits);
-	if (handle->h_rsv_handle) {
-		sub_reserved_credits(journal,
-				     handle->h_rsv_handle->h_buffer_credits);
-	}
-	if (atomic_dec_and_test(&transaction->t_updates))
-		wake_up(&journal->j_wait_updates);
-	tid = transaction->t_tid;
-	spin_unlock(&transaction->t_handle_lock);
+	jbd_debug(2, "restarting handle %p\n", handle);
+	stop_this_handle(handle);
 	handle->h_transaction = NULL;
-	current->journal_info = NULL;
 
-	jbd_debug(2, "restarting handle %p\n", handle);
+	read_lock(&journal->j_state_lock);
 	need_to_start = !tid_geq(journal->j_commit_request, tid);
 	read_unlock(&journal->j_state_lock);
 	if (need_to_start)
 		jbd2_log_start_commit(journal, tid);
-
-	rwsem_release(&journal->j_trans_commit_map, 1, _THIS_IP_);
 	handle->h_buffer_credits = nblocks;
-	/*
-	 * Restore the original nofs context because the journal restart
-	 * is basically the same thing as journal stop and start.
-	 * start_this_handle will start a new nofs context.
-	 */
-	memalloc_nofs_restore(handle->saved_alloc_context);
-	ret = start_this_handle(journal, handle, gfp_mask);
-	return ret;
+	return start_this_handle(journal, handle, gfp_mask);
 }
 EXPORT_SYMBOL(jbd2__journal_restart);
 
@@ -1718,16 +1723,12 @@ int jbd2_journal_stop(handle_t *handle)
 		 * Handle is already detached from the transaction so there is
 		 * nothing to do other than free the handle.
 		 */
-		if (handle->h_rsv_handle)
-			jbd2_free_handle(handle->h_rsv_handle);
+		memalloc_nofs_restore(handle->saved_alloc_context);
 		goto free_and_exit;
 	}
 	journal = transaction->t_journal;
 	tid = transaction->t_tid;
 
-	J_ASSERT(journal_current_handle() == handle);
-	J_ASSERT(atomic_read(&transaction->t_updates) > 0);
-
 	if (is_handle_aborted(handle))
 		err = -EIO;
 
@@ -1797,9 +1798,6 @@ int jbd2_journal_stop(handle_t *handle)
 
 	if (handle->h_sync)
 		transaction->t_synchronous_commit = 1;
-	current->journal_info = NULL;
-	atomic_sub(handle->h_buffer_credits,
-		   &transaction->t_outstanding_credits);
 
 	/*
 	 * If the handle is marked SYNC, we need to set another commit
@@ -1826,27 +1824,19 @@ int jbd2_journal_stop(handle_t *handle)
 	}
 
 	/*
-	 * Once we drop t_updates, if it goes to zero the transaction
-	 * could start committing on us and eventually disappear.  So
-	 * once we do this, we must not dereference transaction
-	 * pointer again.
+	 * Once stop_this_handle() drops t_updates, the transaction could start
+	 * committing on us and eventually disappear.  So we must not
+	 * dereference transaction pointer again after calling
+	 * stop_this_handle().
 	 */
-	if (atomic_dec_and_test(&transaction->t_updates))
-		wake_up(&journal->j_wait_updates);
-
-	rwsem_release(&journal->j_trans_commit_map, 1, _THIS_IP_);
+	stop_this_handle(handle);
 
 	if (wait_for_commit)
 		err = jbd2_log_wait_commit(journal, tid);
 
-	if (handle->h_rsv_handle)
-		jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
-	/*
-	 * Scope of the GFP_NOFS context is over here and so we can restore the
-	 * original alloc context.
-	 */
-	memalloc_nofs_restore(handle->saved_alloc_context);
+	if (handle->h_rsv_handle)
+		jbd2_free_handle(handle->h_rsv_handle);
 	jbd2_free_handle(handle);
 	return err;
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 16/22] jbd2: Account descriptor blocks into t_outstanding_credits
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (14 preceding siblings ...)
  2019-10-03 22:06 ` [PATCH 15/22] jbd2: Factor out common parts of stopping and restarting a handle Jan Kara
@ 2019-10-03 22:06 ` Jan Kara
  2019-10-21 21:04   ` Theodore Y. Ts'o
  2019-10-03 22:06 ` [PATCH 17/22] jbd2: Drop jbd2_space_needed() Jan Kara
                   ` (35 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:06 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

Currently, journal descriptor blocks were not accounted in
transaction->t_outstanding_credits and we were just leaving some slack
space in the journal for them (in jbd2_log_space_left() and
jbd2_space_needed()). This is making proper accounting (and reservation
we want to add) of descriptor blocks difficult so switch to accounting
descriptor blocks in transaction->t_outstanding_credits and just reserve
the same amount of credits in t_outstanding credits for journal
descriptor blocks when creating transaction.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/commit.c      |  3 +++
 fs/jbd2/journal.c     |  1 +
 fs/jbd2/transaction.c | 20 ++++++++++++--------
 include/linux/jbd2.h  | 16 +++-------------
 4 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b67e2d0cff88..43f2dde5bb47 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -889,6 +889,9 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	if (err)
 		jbd2_journal_abort(journal, err);
 
+	WARN_ON_ONCE(
+		atomic_read(&commit_transaction->t_outstanding_credits) < 0);
+
 	/*
 	 * Now disk caches for filesystem device are flushed so we are safe to
 	 * erase checkpointed transactions from the log by updating journal
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 1c58859aa592..a4ec198b10c5 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -840,6 +840,7 @@ jbd2_journal_get_descriptor_buffer(transaction_t *transaction, int type)
 	bh = __getblk(journal->j_dev, blocknr, journal->j_blocksize);
 	if (!bh)
 		return NULL;
+	atomic_dec(&transaction->t_outstanding_credits);
 	lock_buffer(bh);
 	memset(bh->b_data, 0, journal->j_blocksize);
 	header = (journal_header_t *)bh->b_data;
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index d4ee02e5161b..a364d0623884 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -62,6 +62,17 @@ void jbd2_journal_free_transaction(transaction_t *transaction)
 	kmem_cache_free(transaction_cache, transaction);
 }
 
+/*
+ * We reserve t_outstanding_credits >> JBD2_CONTROL_BLOCKS_SHIFT for
+ * transaction descriptor blocks.
+ */
+#define JBD2_CONTROL_BLOCKS_SHIFT 5
+
+static int jbd2_descriptor_blocks_per_trans(journal_t *journal)
+{
+	return journal->j_max_transaction_buffers >> JBD2_CONTROL_BLOCKS_SHIFT;
+}
+
 /*
  * jbd2_get_transaction: obtain a new transaction_t object.
  *
@@ -88,6 +99,7 @@ static void jbd2_get_transaction(journal_t *journal,
 	spin_lock_init(&transaction->t_handle_lock);
 	atomic_set(&transaction->t_updates, 0);
 	atomic_set(&transaction->t_outstanding_credits,
+		   jbd2_descriptor_blocks_per_trans(journal) +
 		   atomic_read(&journal->j_reserved_credits));
 	atomic_set(&transaction->t_handle_count, 0);
 	INIT_LIST_HEAD(&transaction->t_inode_list);
@@ -634,14 +646,6 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
 		goto unlock;
 	}
 
-	if (wanted + (wanted >> JBD2_CONTROL_BLOCKS_SHIFT) >
-	    jbd2_log_space_left(journal)) {
-		jbd_debug(3, "denied handle %p %d blocks: "
-			  "insufficient log space\n", handle, nblocks);
-		atomic_sub(nblocks, &transaction->t_outstanding_credits);
-		goto unlock;
-	}
-
 	trace_jbd2_handle_extend(journal->j_fs_dev->bd_dev,
 				 transaction->t_tid,
 				 handle->h_type, handle->h_line_no,
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 727ff91d7f3e..1bb37d3e3839 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1560,20 +1560,13 @@ static inline int jbd2_journal_has_csum_v2or3(journal_t *journal)
 	return journal->j_chksum_driver != NULL;
 }
 
-/*
- * We reserve t_outstanding_credits >> JBD2_CONTROL_BLOCKS_SHIFT for
- * transaction control blocks.
- */
-#define JBD2_CONTROL_BLOCKS_SHIFT 5
-
 /*
  * Return the minimum number of blocks which must be free in the journal
  * before a new transaction may be started.  Must be called under j_state_lock.
  */
 static inline int jbd2_space_needed(journal_t *journal)
 {
-	int nblocks = journal->j_max_transaction_buffers;
-	return nblocks + (nblocks >> JBD2_CONTROL_BLOCKS_SHIFT);
+	return journal->j_max_transaction_buffers;
 }
 
 /*
@@ -1585,11 +1578,8 @@ static inline unsigned long jbd2_log_space_left(journal_t *journal)
 	long free = journal->j_free - 32;
 
 	if (journal->j_committing_transaction) {
-		unsigned long committing = atomic_read(&journal->
-			j_committing_transaction->t_outstanding_credits);
-
-		/* Transaction + control blocks */
-		free -= committing + (committing >> JBD2_CONTROL_BLOCKS_SHIFT);
+		free -= atomic_read(&journal->
+                        j_committing_transaction->t_outstanding_credits);
 	}
 	return max_t(long, free, 0);
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 17/22] jbd2: Drop jbd2_space_needed()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (15 preceding siblings ...)
  2019-10-03 22:06 ` [PATCH 16/22] jbd2: Account descriptor blocks into t_outstanding_credits Jan Kara
@ 2019-10-03 22:06 ` Jan Kara
  2019-10-21 21:05   ` Theodore Y. Ts'o
  2019-10-03 22:06 ` [PATCH 18/22] jbd2: Reserve space for revoke descriptor blocks Jan Kara
                   ` (34 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:06 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

The function is now just a trivial wrapper returning
journal->j_max_transaction_buffers. Drop it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c  | 2 +-
 fs/jbd2/transaction.c | 5 +++--
 include/linux/jbd2.h  | 9 ---------
 3 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index a1909066bde6..8fff6677a5da 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -110,7 +110,7 @@ void __jbd2_log_wait_for_space(journal_t *journal)
 	int nblocks, space_left;
 	/* assert_spin_locked(&journal->j_state_lock); */
 
-	nblocks = jbd2_space_needed(journal);
+	nblocks = journal->j_max_transaction_buffers;
 	while (jbd2_log_space_left(journal) < nblocks) {
 		write_unlock(&journal->j_state_lock);
 		mutex_lock_io(&journal->j_checkpoint_mutex);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a364d0623884..a4ee905db00f 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -270,12 +270,13 @@ static int add_transaction_credits(journal_t *journal, int blocks,
 	 * *before* starting to dirty potentially checkpointed buffers
 	 * in the new transaction.
 	 */
-	if (jbd2_log_space_left(journal) < jbd2_space_needed(journal)) {
+	if (jbd2_log_space_left(journal) < journal->j_max_transaction_buffers) {
 		atomic_sub(total, &t->t_outstanding_credits);
 		read_unlock(&journal->j_state_lock);
 		jbd2_might_wait_for_commit(journal);
 		write_lock(&journal->j_state_lock);
-		if (jbd2_log_space_left(journal) < jbd2_space_needed(journal))
+		if (jbd2_log_space_left(journal) <
+					journal->j_max_transaction_buffers)
 			__jbd2_log_wait_for_space(journal);
 		write_unlock(&journal->j_state_lock);
 		return 1;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 1bb37d3e3839..dd8905763a3b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1560,15 +1560,6 @@ static inline int jbd2_journal_has_csum_v2or3(journal_t *journal)
 	return journal->j_chksum_driver != NULL;
 }
 
-/*
- * Return the minimum number of blocks which must be free in the journal
- * before a new transaction may be started.  Must be called under j_state_lock.
- */
-static inline int jbd2_space_needed(journal_t *journal)
-{
-	return journal->j_max_transaction_buffers;
-}
-
 /*
  * Return number of free blocks in the log. Must be called under j_state_lock.
  */
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 18/22] jbd2: Reserve space for revoke descriptor blocks
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (16 preceding siblings ...)
  2019-10-03 22:06 ` [PATCH 17/22] jbd2: Drop jbd2_space_needed() Jan Kara
@ 2019-10-03 22:06 ` Jan Kara
  2019-10-21 21:47   ` Theodore Y. Ts'o
  2019-10-03 22:06 ` [PATCH 19/22] jbd2: Rename h_buffer_credits to h_total_credits Jan Kara
                   ` (33 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:06 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

Extend functions for starting, extending, and restarting transaction
handles to take number of revoke records handle must be able to
accommodate. These functions then make sure transaction has enough
credits to be able to store resulting revoke descriptor blocks. Also
revoke code tracks number of revoke records created by a handle to catch
situation where some place didn't reserve enough space for revoke
records. Similarly to standard transaction credits, space for unused
reserved revoke records is released when the handle is stopped.

On the ext4 side we currently take a simplistic approach of reserving
space for 1024 revoke records for any transaction. This grows amount of
credits reserved for each handle only by a few and is enough for any
normal workload so that we don't hit warnings in jbd2. We will refine
the logic in following commits.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4_jbd2.c   |  2 +-
 fs/ext4/ext4_jbd2.h   |  4 ++--
 fs/jbd2/journal.c     | 17 +++++++++++++++++
 fs/jbd2/revoke.c      |  2 ++
 fs/jbd2/transaction.c | 39 ++++++++++++++++++++++++++++++++-------
 fs/ocfs2/journal.c    |  4 ++--
 include/linux/jbd2.h  | 27 ++++++++++++++++++++++-----
 7 files changed, 78 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 731bbfdbce5b..b81190bee32d 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -78,7 +78,7 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
 	journal = EXT4_SB(sb)->s_journal;
 	if (!journal)
 		return ext4_get_nojournal();
-	return jbd2__journal_start(journal, blocks, rsv_blocks, GFP_NOFS,
+	return jbd2__journal_start(journal, blocks, rsv_blocks, 1024, GFP_NOFS,
 				   type, line);
 }
 
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 36aa72599646..aca05e52e317 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -328,14 +328,14 @@ static inline handle_t *ext4_journal_current_handle(void)
 static inline int ext4_journal_extend(handle_t *handle, int nblocks)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2_journal_extend(handle, nblocks);
+		return jbd2_journal_extend(handle, nblocks, 1024);
 	return 0;
 }
 
 static inline int ext4_journal_restart(handle_t *handle, int nblocks)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2_journal_restart(handle, nblocks);
+		return jbd2__journal_restart(handle, nblocks, 1024, GFP_NOFS);
 	return 0;
 }
 
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index a4ec198b10c5..388f0a8c4a37 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1489,6 +1489,21 @@ void jbd2_journal_update_sb_errno(journal_t *journal)
 }
 EXPORT_SYMBOL(jbd2_journal_update_sb_errno);
 
+static int journal_revoke_records_per_block(journal_t *journal)
+{
+	int record_size;
+	int space = journal->j_blocksize - sizeof(jbd2_journal_revoke_header_t);
+
+	if (jbd2_has_feature_64bit(journal))
+		record_size = 8;
+	else
+		record_size = 4;
+
+	if (jbd2_journal_has_csum_v2or3(journal))
+		space -= sizeof(struct jbd2_journal_block_tail);
+	return space / record_size;
+}
+
 /*
  * Read the superblock for a given journal, performing initial
  * validation of the format.
@@ -1597,6 +1612,8 @@ static int journal_get_superblock(journal_t *journal)
 						   sizeof(sb->s_uuid));
 	}
 
+	journal->j_revoke_records_per_block =
+				journal_revoke_records_per_block(journal);
 	set_buffer_verified(bh);
 
 	return 0;
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index f08073d7bbf5..cba797f1d3f4 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -391,6 +391,8 @@ int jbd2_journal_revoke(handle_t *handle, unsigned long long blocknr,
 			__brelse(bh);
 		}
 	}
+	WARN_ON_ONCE(handle->h_revoke_credits <= 0);
+	handle->h_revoke_credits--;
 
 	jbd_debug(2, "insert revoke for block %llu, bh_in=%p\n",blocknr, bh_in);
 	err = insert_revoke_hash(journal, blocknr,
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a4ee905db00f..c59eb08dba3c 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -418,6 +418,7 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 	update_t_max_wait(transaction, ts);
 	handle->h_transaction = transaction;
 	handle->h_requested_credits = blocks;
+	handle->h_revoke_credits_requested = handle->h_revoke_credits;
 	handle->h_start_jiffies = jiffies;
 	atomic_inc(&transaction->t_updates);
 	atomic_inc(&transaction->t_handle_count);
@@ -451,8 +452,8 @@ static handle_t *new_handle(int nblocks)
 }
 
 handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
-			      gfp_t gfp_mask, unsigned int type,
-			      unsigned int line_no)
+			      int revoke_records, gfp_t gfp_mask,
+			      unsigned int type, unsigned int line_no)
 {
 	handle_t *handle = journal_current_handle();
 	int err;
@@ -466,6 +467,8 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 		return handle;
 	}
 
+	nblocks += DIV_ROUND_UP(revoke_records,
+				journal->j_revoke_records_per_block);
 	handle = new_handle(nblocks);
 	if (!handle)
 		return ERR_PTR(-ENOMEM);
@@ -481,6 +484,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 		rsv_handle->h_journal = journal;
 		handle->h_rsv_handle = rsv_handle;
 	}
+	handle->h_revoke_credits = revoke_records;
 
 	err = start_this_handle(journal, handle, gfp_mask);
 	if (err < 0) {
@@ -521,7 +525,7 @@ EXPORT_SYMBOL(jbd2__journal_start);
  */
 handle_t *jbd2_journal_start(journal_t *journal, int nblocks)
 {
-	return jbd2__journal_start(journal, nblocks, 0, GFP_NOFS, 0, 0);
+	return jbd2__journal_start(journal, nblocks, 0, 0, GFP_NOFS, 0, 0);
 }
 EXPORT_SYMBOL(jbd2_journal_start);
 
@@ -598,6 +602,7 @@ EXPORT_SYMBOL(jbd2_journal_start_reserved);
  * int jbd2_journal_extend() - extend buffer credits.
  * @handle:  handle to 'extend'
  * @nblocks: nr blocks to try to extend by.
+ * @revoke_records: number of revoke records to try to extend by.
  *
  * Some transactions, such as large extends and truncates, can be done
  * atomically all at once or in several stages.  The operation requests
@@ -614,7 +619,7 @@ EXPORT_SYMBOL(jbd2_journal_start_reserved);
  * return code < 0 implies an error
  * return code > 0 implies normal transaction-full status.
  */
-int jbd2_journal_extend(handle_t *handle, int nblocks)
+int jbd2_journal_extend(handle_t *handle, int nblocks, int revoke_records)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal;
@@ -636,6 +641,12 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
 		goto error_out;
 	}
 
+	nblocks += DIV_ROUND_UP(
+			handle->h_revoke_credits_requested + revoke_records,
+			journal->j_revoke_records_per_block) -
+		DIV_ROUND_UP(
+			handle->h_revoke_credits_requested,
+			journal->j_revoke_records_per_block);
 	spin_lock(&transaction->t_handle_lock);
 	wanted = atomic_add_return(nblocks,
 				   &transaction->t_outstanding_credits);
@@ -655,6 +666,8 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
 
 	handle->h_buffer_credits += nblocks;
 	handle->h_requested_credits += nblocks;
+	handle->h_revoke_credits += revoke_records;
+	handle->h_revoke_credits_requested += revoke_records;
 	result = 0;
 
 	jbd_debug(3, "extended handle %p by %d\n", handle, nblocks);
@@ -669,10 +682,17 @@ static void stop_this_handle(handle_t *handle)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal = transaction->t_journal;
+	int revoke_descriptors;
 
 	J_ASSERT(journal_current_handle() == handle);
 	J_ASSERT(atomic_read(&transaction->t_updates) > 0);
 	current->journal_info = NULL;
+	/* Subtract necessary revoke descriptor blocks from handle credits */
+	revoke_descriptors = DIV_ROUND_UP(
+		handle->h_revoke_credits_requested - handle->h_revoke_credits,
+		journal->j_revoke_records_per_block);
+	WARN_ON_ONCE(revoke_descriptors > handle->h_buffer_credits);
+	handle->h_buffer_credits -= revoke_descriptors;
 	atomic_sub(handle->h_buffer_credits,
 		   &transaction->t_outstanding_credits);
 	if (handle->h_rsv_handle)
@@ -692,6 +712,7 @@ static void stop_this_handle(handle_t *handle)
  * int jbd2_journal_restart() - restart a handle .
  * @handle:  handle to restart
  * @nblocks: nr credits requested
+ * @revoke_records: number of revoke record credits requested
  * @gfp_mask: memory allocation flags (for start_this_handle)
  *
  * Restart a handle for a multi-transaction filesystem
@@ -704,7 +725,8 @@ static void stop_this_handle(handle_t *handle)
  * credits. We preserve reserved handle if there's any attached to the
  * passed in handle.
  */
-int jbd2__journal_restart(handle_t *handle, int nblocks, gfp_t gfp_mask)
+int jbd2__journal_restart(handle_t *handle, int nblocks, int revoke_records,
+			  gfp_t gfp_mask)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal;
@@ -731,7 +753,10 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, gfp_t gfp_mask)
 	read_unlock(&journal->j_state_lock);
 	if (need_to_start)
 		jbd2_log_start_commit(journal, tid);
-	handle->h_buffer_credits = nblocks;
+	handle->h_buffer_credits = nblocks +
+		DIV_ROUND_UP(revoke_records,
+			     journal->j_revoke_records_per_block);
+	handle->h_revoke_credits = revoke_records;
 	return start_this_handle(journal, handle, gfp_mask);
 }
 EXPORT_SYMBOL(jbd2__journal_restart);
@@ -739,7 +764,7 @@ EXPORT_SYMBOL(jbd2__journal_restart);
 
 int jbd2_journal_restart(handle_t *handle, int nblocks)
 {
-	return jbd2__journal_restart(handle, nblocks, GFP_NOFS);
+	return jbd2__journal_restart(handle, nblocks, 0, GFP_NOFS);
 }
 EXPORT_SYMBOL(jbd2_journal_restart);
 
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 019aaf2a3f8a..a032f0297dad 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -426,7 +426,7 @@ int ocfs2_extend_trans(handle_t *handle, int nblocks)
 #ifdef CONFIG_OCFS2_DEBUG_FS
 	status = 1;
 #else
-	status = jbd2_journal_extend(handle, nblocks);
+	status = jbd2_journal_extend(handle, nblocks, 0);
 	if (status < 0) {
 		mlog_errno(status);
 		goto bail;
@@ -466,7 +466,7 @@ int ocfs2_allocate_extend_trans(handle_t *handle, int thresh)
 	if (old_nblks < thresh)
 		return 0;
 
-	status = jbd2_journal_extend(handle, OCFS2_MAX_TRANS_DATA);
+	status = jbd2_journal_extend(handle, OCFS2_MAX_TRANS_DATA, 0);
 	if (status < 0) {
 		mlog_errno(status);
 		goto bail;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dd8905763a3b..36100fe9eab9 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -478,6 +478,7 @@ struct jbd2_revoke_table_s;
  * @h_journal: Which journal handle belongs to - used iff h_reserved set.
  * @h_rsv_handle: Handle reserved for finishing the logical operation.
  * @h_buffer_credits: Number of remaining buffers we are allowed to dirty.
+ * @h_revoke_credits: Number of remaining revoke records available for handle
  * @h_ref: Reference count on this handle.
  * @h_err: Field for caller's use to track errors through large fs operations.
  * @h_sync: Flag for sync-on-close.
@@ -488,6 +489,7 @@ struct jbd2_revoke_table_s;
  * @h_line_no: For handle statistics.
  * @h_start_jiffies: Handle Start time.
  * @h_requested_credits: Holds @h_buffer_credits after handle is started.
+ * @h_revoke_credits_requested: Holds @h_revoke_credits after handle is started.
  * @saved_alloc_context: Saved context while transaction is open.
  **/
 
@@ -505,6 +507,8 @@ struct jbd2_journal_handle
 
 	handle_t		*h_rsv_handle;
 	int			h_buffer_credits;
+	int			h_revoke_credits;
+	int			h_revoke_credits_requested;
 	int			h_ref;
 	int			h_err;
 
@@ -1024,6 +1028,13 @@ struct journal_s
 	 */
 	int			j_max_transaction_buffers;
 
+	/**
+	 * @j_revoke_records_per_block:
+	 *
+	 * Number of revoke records that fit in one descriptor block.
+	 */
+	int			j_revoke_records_per_block;
+
 	/**
 	 * @j_commit_interval:
 	 *
@@ -1358,14 +1369,16 @@ static inline handle_t *journal_current_handle(void)
 
 extern handle_t *jbd2_journal_start(journal_t *, int nblocks);
 extern handle_t *jbd2__journal_start(journal_t *, int blocks, int rsv_blocks,
-				     gfp_t gfp_mask, unsigned int type,
-				     unsigned int line_no);
+				     int revoke_records, gfp_t gfp_mask,
+				     unsigned int type, unsigned int line_no);
 extern int	 jbd2_journal_restart(handle_t *, int nblocks);
-extern int	 jbd2__journal_restart(handle_t *, int nblocks, gfp_t gfp_mask);
+extern int	 jbd2__journal_restart(handle_t *, int nblocks,
+				       int revoke_records, gfp_t gfp_mask);
 extern int	 jbd2_journal_start_reserved(handle_t *handle,
 				unsigned int type, unsigned int line_no);
 extern void	 jbd2_journal_free_reserved(handle_t *handle);
-extern int	 jbd2_journal_extend (handle_t *, int nblocks);
+extern int	 jbd2_journal_extend(handle_t *handle, int nblocks,
+				     int revoke_records);
 extern int	 jbd2_journal_get_write_access(handle_t *, struct buffer_head *);
 extern int	 jbd2_journal_get_create_access (handle_t *, struct buffer_head *);
 extern int	 jbd2_journal_get_undo_access(handle_t *, struct buffer_head *);
@@ -1629,7 +1642,11 @@ static inline tid_t  jbd2_get_latest_transaction(journal_t *journal)
 
 static inline int jbd2_handle_buffer_credits(handle_t *handle)
 {
-	return handle->h_buffer_credits;
+	journal_t *journal = handle->h_transaction->t_journal;
+
+	return handle->h_buffer_credits -
+		DIV_ROUND_UP(handle->h_revoke_credits_requested,
+			     journal->j_revoke_records_per_block);
 }
 
 #ifdef __KERNEL__
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 19/22] jbd2: Rename h_buffer_credits to h_total_credits
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (17 preceding siblings ...)
  2019-10-03 22:06 ` [PATCH 18/22] jbd2: Reserve space for revoke descriptor blocks Jan Kara
@ 2019-10-03 22:06 ` Jan Kara
  2019-10-21 21:48   ` Theodore Y. Ts'o
  2019-10-03 22:06 ` [PATCH 20/22] jbd2: Make credit checking more strict Jan Kara
                   ` (32 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:06 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

The credit counter now contains both buffer and revoke descriptor block
credits. Rename to counter to h_total_credits to reflect that. No
functional change.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 30 +++++++++++++++---------------
 include/linux/jbd2.h  |  9 +++++----
 2 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index c59eb08dba3c..8851cbbe3579 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -312,12 +312,12 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 			     gfp_t gfp_mask)
 {
 	transaction_t	*transaction, *new_transaction = NULL;
-	int		blocks = handle->h_buffer_credits;
+	int		blocks = handle->h_total_credits;
 	int		rsv_blocks = 0;
 	unsigned long ts = jiffies;
 
 	if (handle->h_rsv_handle)
-		rsv_blocks = handle->h_rsv_handle->h_buffer_credits;
+		rsv_blocks = handle->h_rsv_handle->h_total_credits;
 
 	/*
 	 * Limit the number of reserved credits to 1/2 of maximum transaction
@@ -445,7 +445,7 @@ static handle_t *new_handle(int nblocks)
 	handle_t *handle = jbd2_alloc_handle(GFP_NOFS);
 	if (!handle)
 		return NULL;
-	handle->h_buffer_credits = nblocks;
+	handle->h_total_credits = nblocks;
 	handle->h_ref = 1;
 
 	return handle;
@@ -534,7 +534,7 @@ static void __jbd2_journal_unreserve_handle(handle_t *handle)
 	journal_t *journal = handle->h_journal;
 
 	WARN_ON(!handle->h_reserved);
-	sub_reserved_credits(journal, handle->h_buffer_credits);
+	sub_reserved_credits(journal, handle->h_total_credits);
 }
 
 void jbd2_journal_free_reserved(handle_t *handle)
@@ -593,7 +593,7 @@ int jbd2_journal_start_reserved(handle_t *handle, unsigned int type,
 	handle->h_line_no = line_no;
 	trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
 				handle->h_transaction->t_tid, type,
-				line_no, handle->h_buffer_credits);
+				line_no, handle->h_total_credits);
 	return 0;
 }
 EXPORT_SYMBOL(jbd2_journal_start_reserved);
@@ -661,10 +661,10 @@ int jbd2_journal_extend(handle_t *handle, int nblocks, int revoke_records)
 	trace_jbd2_handle_extend(journal->j_fs_dev->bd_dev,
 				 transaction->t_tid,
 				 handle->h_type, handle->h_line_no,
-				 handle->h_buffer_credits,
+				 handle->h_total_credits,
 				 nblocks);
 
-	handle->h_buffer_credits += nblocks;
+	handle->h_total_credits += nblocks;
 	handle->h_requested_credits += nblocks;
 	handle->h_revoke_credits += revoke_records;
 	handle->h_revoke_credits_requested += revoke_records;
@@ -691,9 +691,9 @@ static void stop_this_handle(handle_t *handle)
 	revoke_descriptors = DIV_ROUND_UP(
 		handle->h_revoke_credits_requested - handle->h_revoke_credits,
 		journal->j_revoke_records_per_block);
-	WARN_ON_ONCE(revoke_descriptors > handle->h_buffer_credits);
-	handle->h_buffer_credits -= revoke_descriptors;
-	atomic_sub(handle->h_buffer_credits,
+	WARN_ON_ONCE(revoke_descriptors > handle->h_total_credits);
+	handle->h_total_credits -= revoke_descriptors;
+	atomic_sub(handle->h_total_credits,
 		   &transaction->t_outstanding_credits);
 	if (handle->h_rsv_handle)
 		__jbd2_journal_unreserve_handle(handle->h_rsv_handle);
@@ -753,7 +753,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int revoke_records,
 	read_unlock(&journal->j_state_lock);
 	if (need_to_start)
 		jbd2_log_start_commit(journal, tid);
-	handle->h_buffer_credits = nblocks +
+	handle->h_total_credits = nblocks +
 		DIV_ROUND_UP(revoke_records,
 			     journal->j_revoke_records_per_block);
 	handle->h_revoke_credits = revoke_records;
@@ -1458,12 +1458,12 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
 		 * of the transaction. This needs to be done
 		 * once a transaction -bzzz
 		 */
-		if (handle->h_buffer_credits <= 0) {
+		if (handle->h_total_credits <= 0) {
 			ret = -ENOSPC;
 			goto out_unlock_bh;
 		}
 		jh->b_modified = 1;
-		handle->h_buffer_credits--;
+		handle->h_total_credits--;
 	}
 
 	/*
@@ -1707,7 +1707,7 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
 drop:
 	if (drop_reserve) {
 		/* no need to reserve log space for this block -bzzz */
-		handle->h_buffer_credits++;
+		handle->h_total_credits++;
 	}
 	return err;
 
@@ -1768,7 +1768,7 @@ int jbd2_journal_stop(handle_t *handle)
 				jiffies - handle->h_start_jiffies,
 				handle->h_sync, handle->h_requested_credits,
 				(handle->h_requested_credits -
-				 handle->h_buffer_credits));
+				 handle->h_total_credits));
 
 	/*
 	 * Implement synchronous transaction batching.  If the handle
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 36100fe9eab9..94116b1ff274 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -477,7 +477,8 @@ struct jbd2_revoke_table_s;
  * @h_transaction: Which compound transaction is this update a part of?
  * @h_journal: Which journal handle belongs to - used iff h_reserved set.
  * @h_rsv_handle: Handle reserved for finishing the logical operation.
- * @h_buffer_credits: Number of remaining buffers we are allowed to dirty.
+ * @h_total_credits: Number of remaining buffers we are allowed to add to
+	journal. These are dirty buffers and revoke descriptor blocks.
  * @h_revoke_credits: Number of remaining revoke records available for handle
  * @h_ref: Reference count on this handle.
  * @h_err: Field for caller's use to track errors through large fs operations.
@@ -488,7 +489,7 @@ struct jbd2_revoke_table_s;
  * @h_type: For handle statistics.
  * @h_line_no: For handle statistics.
  * @h_start_jiffies: Handle Start time.
- * @h_requested_credits: Holds @h_buffer_credits after handle is started.
+ * @h_requested_credits: Holds @h_total_credits after handle is started.
  * @h_revoke_credits_requested: Holds @h_revoke_credits after handle is started.
  * @saved_alloc_context: Saved context while transaction is open.
  **/
@@ -506,7 +507,7 @@ struct jbd2_journal_handle
 	};
 
 	handle_t		*h_rsv_handle;
-	int			h_buffer_credits;
+	int			h_total_credits;
 	int			h_revoke_credits;
 	int			h_revoke_credits_requested;
 	int			h_ref;
@@ -1644,7 +1645,7 @@ static inline int jbd2_handle_buffer_credits(handle_t *handle)
 {
 	journal_t *journal = handle->h_transaction->t_journal;
 
-	return handle->h_buffer_credits -
+	return handle->h_total_credits -
 		DIV_ROUND_UP(handle->h_revoke_credits_requested,
 			     journal->j_revoke_records_per_block);
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 20/22] jbd2: Make credit checking more strict
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (18 preceding siblings ...)
  2019-10-03 22:06 ` [PATCH 19/22] jbd2: Rename h_buffer_credits to h_total_credits Jan Kara
@ 2019-10-03 22:06 ` Jan Kara
  2019-10-21 22:29   ` Theodore Y. Ts'o
  2019-10-03 22:06 ` [PATCH 21/22] ext4: Reserve revoke credits for freed blocks Jan Kara
                   ` (31 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:06 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

Make checking of available credits in jbd2_journal_dirty_metadata() more
strict. There should be always enough credits in the handle to write all
potential revoke descriptors. Also we warn in case there are not enough
credits since this is a bug in the filesystem.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 8851cbbe3579..66fad49d45df 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1458,7 +1458,7 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
 		 * of the transaction. This needs to be done
 		 * once a transaction -bzzz
 		 */
-		if (handle->h_total_credits <= 0) {
+		if (WARN_ON_ONCE(jbd2_handle_buffer_credits(handle) <= 0)) {
 			ret = -ENOSPC;
 			goto out_unlock_bh;
 		}
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 21/22] ext4: Reserve revoke credits for freed blocks
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (19 preceding siblings ...)
  2019-10-03 22:06 ` [PATCH 20/22] jbd2: Make credit checking more strict Jan Kara
@ 2019-10-03 22:06 ` Jan Kara
  2019-10-21 23:18   ` Theodore Y. Ts'o
  2019-10-03 22:06 ` [PATCH 22/22] jbd2: Provide trace event for handle restarts Jan Kara
                   ` (30 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:06 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

So far we have reserved only relatively high fixed amount of revoke
credits for each transaction. We over-reserved by large amount for most
cases but when freeing large directories or files with data journalling,
the fixed amount is not enough. In fact the worst case estimate is
inconveniently large (maximum extent size) for freeing of one extent.

We fix this by doing proper estimate of the amount of blocks that need
to be revoked when removing blocks from the inode due to truncate or
hole punching and otherwise reserve just a small amount of revoke
credits for each transaction to accommodate freeing of xattrs block or
so.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h              |  3 +-
 fs/ext4/ext4_jbd2.c         | 20 ++++++-----
 fs/ext4/ext4_jbd2.h         | 84 +++++++++++++++++++++++++++++++--------------
 fs/ext4/extents.c           | 27 +++++++++++----
 fs/ext4/ialloc.c            |  2 +-
 fs/ext4/indirect.c          | 12 ++++---
 fs/ext4/inode.c             |  2 +-
 fs/ext4/migrate.c           | 24 ++++++++-----
 fs/ext4/resize.c            | 16 ++++++---
 fs/ext4/xattr.c             |  4 ++-
 include/trace/events/ext4.h | 13 ++++---
 11 files changed, 140 insertions(+), 67 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 67a6fcc11182..a606d17a80b0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3296,7 +3296,8 @@ extern int ext4_swap_extents(handle_t *handle, struct inode *inode1,
 			     int mark_unwritten,int *err);
 extern int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu);
 extern int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode,
-				       int check_cred, int restart_cred);
+				       int check_cred, int restart_cred,
+				       int revoke_cred);
 
 
 /* move_extent.c */
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index b81190bee32d..d3b8cdea5df7 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -65,12 +65,14 @@ static int ext4_journal_check_start(struct super_block *sb)
 }
 
 handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
-				  int type, int blocks, int rsv_blocks)
+				  int type, int blocks, int rsv_blocks,
+				  int revoke_creds)
 {
 	journal_t *journal;
 	int err;
 
-	trace_ext4_journal_start(sb, blocks, rsv_blocks, _RET_IP_);
+	trace_ext4_journal_start(sb, blocks, rsv_blocks, revoke_creds,
+				 _RET_IP_);
 	err = ext4_journal_check_start(sb);
 	if (err < 0)
 		return ERR_PTR(err);
@@ -78,8 +80,8 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
 	journal = EXT4_SB(sb)->s_journal;
 	if (!journal)
 		return ext4_get_nojournal();
-	return jbd2__journal_start(journal, blocks, rsv_blocks, 1024, GFP_NOFS,
-				   type, line);
+	return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds,
+				   GFP_NOFS, type, line);
 }
 
 int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle)
@@ -134,14 +136,16 @@ handle_t *__ext4_journal_start_reserved(handle_t *handle, unsigned int line,
 }
 
 int __ext4_journal_ensure_credits(handle_t *handle, int check_cred,
-				  int extend_cred)
+				  int extend_cred, int revoke_cred)
 {
 	if (!ext4_handle_valid(handle))
 		return 0;
-	if (jbd2_handle_buffer_credits(handle) >= check_cred)
+	if (jbd2_handle_buffer_credits(handle) >= check_cred &&
+	    handle->h_revoke_credits >= revoke_cred)
 		return 0;
-	return ext4_journal_extend(handle,
-			   extend_cred - jbd2_handle_buffer_credits(handle));
+	extend_cred = max(0, extend_cred - jbd2_handle_buffer_credits(handle));
+	revoke_cred = max(0, revoke_cred - handle->h_revoke_credits);
+	return ext4_journal_extend(handle, extend_cred, revoke_cred);
 }
 
 static void ext4_journal_abort_handle(const char *caller, unsigned int line,
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index aca05e52e317..acc47943f576 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -261,7 +261,8 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
 	__ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb))
 
 handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
-				  int type, int blocks, int rsv_blocks);
+				  int type, int blocks, int rsv_blocks,
+				  int revoke_creds);
 int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle);
 
 #define EXT4_NOJOURNAL_MAX_REF_COUNT ((unsigned long) 4096)
@@ -288,21 +289,41 @@ static inline int ext4_handle_is_aborted(handle_t *handle)
 	return 0;
 }
 
+static inline int ext4_free_metadata_revoke_credits(struct super_block *sb,
+						    int blocks)
+{
+	/* Freeing each metadata block can result in freeing one cluster */
+	return blocks * EXT4_SB(sb)->s_cluster_ratio;
+}
+
+static inline int ext4_trans_default_revoke_credits(struct super_block *sb)
+{
+	return ext4_free_metadata_revoke_credits(sb, 8);
+}
+
 #define ext4_journal_start_sb(sb, type, nblocks)			\
-	__ext4_journal_start_sb((sb), __LINE__, (type), (nblocks), 0)
+	__ext4_journal_start_sb((sb), __LINE__, (type), (nblocks), 0,	\
+				ext4_trans_default_revoke_credits(sb))
 
 #define ext4_journal_start(inode, type, nblocks)			\
-	__ext4_journal_start((inode), __LINE__, (type), (nblocks), 0)
+	__ext4_journal_start((inode), __LINE__, (type), (nblocks), 0,	\
+			     ext4_trans_default_revoke_credits((inode)->i_sb))
 
-#define ext4_journal_start_with_reserve(inode, type, blocks, rsv_blocks) \
-	__ext4_journal_start((inode), __LINE__, (type), (blocks), (rsv_blocks))
+#define ext4_journal_start_with_reserve(inode, type, blocks, rsv_blocks)\
+	__ext4_journal_start((inode), __LINE__, (type), (blocks), (rsv_blocks),\
+			     ext4_trans_default_revoke_credits((inode)->i_sb))
+
+#define ext4_journal_start_with_revoke(inode, type, blocks, revoke_creds) \
+	__ext4_journal_start((inode), __LINE__, (type), (blocks), 0,	\
+			     (revoke_creds))
 
 static inline handle_t *__ext4_journal_start(struct inode *inode,
 					     unsigned int line, int type,
-					     int blocks, int rsv_blocks)
+					     int blocks, int rsv_blocks,
+					     int revoke_creds)
 {
 	return __ext4_journal_start_sb(inode->i_sb, line, type, blocks,
-				       rsv_blocks);
+				       rsv_blocks, revoke_creds);
 }
 
 #define ext4_journal_stop(handle) \
@@ -325,22 +346,23 @@ static inline handle_t *ext4_journal_current_handle(void)
 	return journal_current_handle();
 }
 
-static inline int ext4_journal_extend(handle_t *handle, int nblocks)
+static inline int ext4_journal_extend(handle_t *handle, int nblocks, int revoke)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2_journal_extend(handle, nblocks, 1024);
+		return jbd2_journal_extend(handle, nblocks, revoke);
 	return 0;
 }
 
-static inline int ext4_journal_restart(handle_t *handle, int nblocks)
+static inline int ext4_journal_restart(handle_t *handle, int nblocks,
+				       int revoke)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2__journal_restart(handle, nblocks, 1024, GFP_NOFS);
+		return jbd2__journal_restart(handle, nblocks, revoke, GFP_NOFS);
 	return 0;
 }
 
 int __ext4_journal_ensure_credits(handle_t *handle, int check_cred,
-				  int extend_cred);
+				  int extend_cred, int revoke_cred);
 
 
 /*
@@ -353,18 +375,19 @@ int __ext4_journal_ensure_credits(handle_t *handle, int check_cred,
  * credits or transaction extension succeeded, 1 in case transaction had to be
  * restarted.
  */
-#define ext4_journal_ensure_credits_fn(handle, check_cred, extend_cred, fn) \
+#define ext4_journal_ensure_credits_fn(handle, check_cred, extend_cred,	\
+				       revoke_cred, fn) \
 ({									\
 	__label__ __ensure_end;						\
 	int err = __ext4_journal_ensure_credits((handle), (check_cred),	\
-						(extend_cred));		\
+					(extend_cred), (revoke_cred));	\
 									\
 	if (err <= 0)							\
 		goto __ensure_end;					\
 	err = (fn);							\
 	if (err < 0)							\
 		goto __ensure_end;					\
-	err = ext4_journal_restart((handle), (extend_cred));		\
+	err = ext4_journal_restart((handle), (extend_cred), (revoke_cred)); \
 	if (err == 0)							\
 		err = 1;						\
 __ensure_end:								\
@@ -373,18 +396,16 @@ __ensure_end:								\
 
 /*
  * Ensure given handle has at least requested amount of credits available,
- * possibly restarting transaction if needed.
+ * possibly restarting transaction if needed. We also make sure the transaction
+ * has space for at least ext4_trans_default_revoke_credits(sb) revoke records
+ * as freeing one or two blocks is very common pattern and requesting this is
+ * very cheap.
  */
-static inline int ext4_journal_ensure_credits(handle_t *handle, int credits)
+static inline int ext4_journal_ensure_credits(handle_t *handle, int credits,
+					      int revoke_creds)
 {
-	return ext4_journal_ensure_credits_fn(handle, credits, credits, 0);
-}
-
-static inline int ext4_journal_ensure_credits_batch(handle_t *handle,
-						    int credits)
-{
-	return ext4_journal_ensure_credits_fn(handle, credits,
-					      EXT4_MAX_TRANS_DATA, 0);
+	return ext4_journal_ensure_credits_fn(handle, credits, credits,
+				revoke_creds, 0);
 }
 
 static inline int ext4_journal_blocks_per_page(struct inode *inode)
@@ -479,6 +500,19 @@ static inline int ext4_should_writeback_data(struct inode *inode)
 	return ext4_inode_journal_mode(inode) & EXT4_INODE_WRITEBACK_DATA_MODE;
 }
 
+static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
+{
+	if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
+		return 0;
+	if (!ext4_should_journal_data(inode))
+		return 0;
+	/*
+	 * Data blocks in one extent are contiguous, just account for partial
+	 * clusters at extent boundaries
+	 */
+	return blocks + 2*EXT4_SB(inode->i_sb)->s_cluster_ratio;
+}
+
 /*
  * This function controls whether or not we should try to go down the
  * dioread_nolock code paths, which makes it safe to avoid taking
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 32f2c22c7ef2..ed28b21b826d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -124,13 +124,14 @@ static int ext4_ext_trunc_restart_fn(struct inode *inode, int *dropped)
  * and < 0 in case of fatal error.
  */
 int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode,
-				int check_cred, int restart_cred)
+				int check_cred, int restart_cred,
+				int revoke_cred)
 {
 	int ret;
 	int dropped = 0;
 
 	ret = ext4_journal_ensure_credits_fn(handle, check_cred, restart_cred,
-			ext4_ext_trunc_restart_fn(inode, &dropped));
+		revoke_cred, ext4_ext_trunc_restart_fn(inode, &dropped));
 	if (dropped)
 		down_write(&EXT4_I(inode)->i_data_sem);
 	return ret;
@@ -1851,7 +1852,8 @@ static void ext4_ext_try_to_merge_up(handle_t *handle,
 	 * group descriptor to release the extent tree block.  If we
 	 * can't get the journal credits, give up.
 	 */
-	if (ext4_journal_extend(handle, 2))
+	if (ext4_journal_extend(handle, 2,
+			ext4_free_metadata_revoke_credits(inode->i_sb, 1)))
 		return;
 
 	/*
@@ -2738,7 +2740,7 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	int err = 0, correct_index = 0;
-	int depth = ext_depth(inode), credits;
+	int depth = ext_depth(inode), credits, revoke_credits;
 	struct ext4_extent_header *eh;
 	ext4_lblk_t a, b;
 	unsigned num;
@@ -2830,9 +2832,18 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
 			credits += (ext_depth(inode)) + 1;
 		}
 		credits += EXT4_MAXQUOTAS_TRANS_BLOCKS(inode->i_sb);
+		/*
+		 * We may end up freeing some index blocks and data from the
+		 * punched range. Note that partial clusters are accounted for
+		 * by ext4_free_data_revoke_credits().
+		 */
+		revoke_credits =
+			ext4_free_metadata_revoke_credits(inode->i_sb,
+							  ext_depth(inode)) +
+			ext4_free_data_revoke_credits(inode, b - a + 1);
 
 		err = ext4_datasem_ensure_credits(handle, inode, credits,
-						  credits);
+						  credits, revoke_credits);
 		if (err) {
 			if (err > 0)
 				err = -EAGAIN;
@@ -2963,7 +2974,9 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 	ext_debug("truncate since %u to %u\n", start, end);
 
 	/* probably first extent we're gonna free will be last in block */
-	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, depth + 1);
+	handle = ext4_journal_start_with_revoke(inode, EXT4_HT_TRUNCATE,
+			depth + 1,
+			ext4_free_metadata_revoke_credits(inode->i_sb, depth));
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
 
@@ -5222,7 +5235,7 @@ ext4_access_path(handle_t *handle, struct inode *inode,
 	 * groups
 	 */
 	credits = ext4_writepage_trans_blocks(inode);
-	err = ext4_datasem_ensure_credits(handle, inode, 7, credits);
+	err = ext4_datasem_ensure_credits(handle, inode, 7, credits, 0);
 	if (err < 0)
 		return err;
 
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 764ff4c56233..fa8c3c485e4b 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -927,7 +927,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 			BUG_ON(nblocks <= 0);
 			handle = __ext4_journal_start_sb(dir->i_sb, line_no,
 							 handle_type, nblocks,
-							 0);
+							 0, 0);
 			if (IS_ERR(handle)) {
 				err = PTR_ERR(handle);
 				ext4_std_error(sb, err);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 63e1d5846442..3a4ab70fe9e0 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -736,13 +736,14 @@ static int ext4_ind_trunc_restart_fn(handle_t *handle, struct inode *inode,
  */
 static int ext4_ind_truncate_ensure_credits(handle_t *handle,
 					    struct inode *inode,
-					    struct buffer_head *bh)
+					    struct buffer_head *bh,
+					    int revoke_creds)
 {
 	int ret;
 	int dropped = 0;
 
 	ret = ext4_journal_ensure_credits_fn(handle, EXT4_RESERVE_TRANS_BLOCKS,
-			ext4_blocks_for_truncate(inode),
+			ext4_blocks_for_truncate(inode), revoke_creds,
 			ext4_ind_trunc_restart_fn(handle, inode, bh, &dropped));
 	if (dropped)
 		down_write(&EXT4_I(inode)->i_data_sem);
@@ -889,7 +890,8 @@ static int ext4_clear_blocks(handle_t *handle, struct inode *inode,
 		return 1;
 	}
 
-	err = ext4_ind_truncate_ensure_credits(handle, inode, bh);
+	err = ext4_ind_truncate_ensure_credits(handle, inode, bh,
+				ext4_free_data_revoke_credits(inode, count));
 	if (err < 0)
 		goto out_err;
 
@@ -1075,7 +1077,9 @@ static void ext4_free_branches(handle_t *handle, struct inode *inode,
 			if (ext4_handle_is_aborted(handle))
 				return;
 			if (ext4_ind_truncate_ensure_credits(handle, inode,
-							     NULL) < 0)
+					NULL,
+					ext4_free_metadata_revoke_credits(
+							inode->i_sb, 1)) < 0)
 				return;
 
 			/*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5a60176edc25..f36c1ccb2252 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5945,7 +5945,7 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 	 * force a large enough s_min_extra_isize.
 	 */
 	if (ext4_journal_extend(handle,
-				EXT4_DATA_TRANS_BLOCKS(inode->i_sb)) != 0)
+				EXT4_DATA_TRANS_BLOCKS(inode->i_sb), 0) != 0)
 		return -ENOSPC;
 
 	if (ext4_write_trylock_xattr(inode, &no_expand) == 0)
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 65f09dc9d941..89725fa42573 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -50,7 +50,7 @@ static int finish_range(handle_t *handle, struct inode *inode,
 	needed = ext4_ext_calc_credits_for_single_extent(inode,
 		    lb->last_block - lb->first_block + 1, path);
 
-	retval = ext4_datasem_ensure_credits(handle, inode, needed, needed);
+	retval = ext4_datasem_ensure_credits(handle, inode, needed, needed, 0);
 	if (retval < 0)
 		goto err_out;
 	retval = ext4_ext_insert_extent(handle, inode, &path, &newext, 0);
@@ -182,10 +182,11 @@ static int free_dind_blocks(handle_t *handle,
 	int i;
 	__le32 *tmp_idata;
 	struct buffer_head *bh;
+	struct super_block *sb = inode->i_sb;
 	unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
 	int err;
 
-	bh = ext4_sb_bread(inode->i_sb, le32_to_cpu(i_data), 0);
+	bh = ext4_sb_bread(sb, le32_to_cpu(i_data), 0);
 	if (IS_ERR(bh))
 		return PTR_ERR(bh);
 
@@ -193,7 +194,8 @@ static int free_dind_blocks(handle_t *handle,
 	for (i = 0; i < max_entries; i++) {
 		if (tmp_idata[i]) {
 			err = ext4_journal_ensure_credits(handle,
-						EXT4_RESERVE_TRANS_BLOCKS);
+				EXT4_RESERVE_TRANS_BLOCKS,
+				ext4_free_metadata_revoke_credits(sb, 1));
 			if (err < 0) {
 				put_bh(bh);
 				return err;
@@ -205,7 +207,8 @@ static int free_dind_blocks(handle_t *handle,
 		}
 	}
 	put_bh(bh);
-	err = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS);
+	err = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS,
+				ext4_free_metadata_revoke_credits(sb, 1));
 	if (err < 0)
 		return err;
 	ext4_free_blocks(handle, inode, NULL, le32_to_cpu(i_data), 1,
@@ -238,7 +241,8 @@ static int free_tind_blocks(handle_t *handle,
 		}
 	}
 	put_bh(bh);
-	retval = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS);
+	retval = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS,
+			ext4_free_metadata_revoke_credits(inode->i_sb, 1));
 	if (retval < 0)
 		return retval;
 	ext4_free_blocks(handle, inode, NULL, le32_to_cpu(i_data), 1,
@@ -254,7 +258,8 @@ static int free_ind_block(handle_t *handle, struct inode *inode, __le32 *i_data)
 	/* ei->i_data[EXT4_IND_BLOCK] */
 	if (i_data[0]) {
 		retval = ext4_journal_ensure_credits(handle,
-						     EXT4_RESERVE_TRANS_BLOCKS);
+			EXT4_RESERVE_TRANS_BLOCKS,
+			ext4_free_metadata_revoke_credits(inode->i_sb, 1));
 		if (retval < 0)
 			return retval;
 		ext4_free_blocks(handle, inode, NULL,
@@ -291,7 +296,7 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
 	 * One credit accounted for writing the
 	 * i_data field of the original inode
 	 */
-	retval = ext4_journal_ensure_credits(handle, 1);
+	retval = ext4_journal_ensure_credits(handle, 1, 0);
 	if (retval < 0)
 		goto err_out;
 
@@ -368,7 +373,8 @@ static int free_ext_idx(handle_t *handle, struct inode *inode,
 		}
 	}
 	put_bh(bh);
-	retval = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS);
+	retval = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS,
+			ext4_free_metadata_revoke_credits(inode->i_sb, 1));
 	if (retval < 0)
 		return retval;
 	ext4_free_blocks(handle, inode, NULL, block, 1,
@@ -548,7 +554,7 @@ int ext4_ext_migrate(struct inode *inode)
 	}
 
 	/* We mark the tmp_inode dirty via ext4_ext_tree_init. */
-	retval = ext4_journal_ensure_credits(handle, 1);
+	retval = ext4_journal_ensure_credits(handle, 1, 0);
 	if (retval < 0)
 		goto out_stop;
 	/*
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 3e4286b3901f..a8c0f2b5b6e1 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -388,6 +388,12 @@ static struct buffer_head *bclean(handle_t *handle, struct super_block *sb,
 	return bh;
 }
 
+static int ext4_resize_ensure_credits_batch(handle_t *handle, int credits)
+{
+	return ext4_journal_ensure_credits_fn(handle, credits,
+		EXT4_MAX_TRANS_DATA, 0, 0);
+}
+
 /*
  * set_flexbg_block_bitmap() mark clusters [@first_cluster, @last_cluster] used.
  *
@@ -427,7 +433,7 @@ static int set_flexbg_block_bitmap(struct super_block *sb, handle_t *handle,
 			continue;
 		}
 
-		err = ext4_journal_ensure_credits_batch(handle, 1);
+		err = ext4_resize_ensure_credits_batch(handle, 1);
 		if (err < 0)
 			return err;
 
@@ -520,7 +526,7 @@ static int setup_new_flex_group_blocks(struct super_block *sb,
 			struct buffer_head *gdb;
 
 			ext4_debug("update backup group %#04llx\n", block);
-			err = ext4_journal_ensure_credits_batch(handle, 1);
+			err = ext4_resize_ensure_credits_batch(handle, 1);
 			if (err < 0)
 				goto out;
 
@@ -578,7 +584,7 @@ static int setup_new_flex_group_blocks(struct super_block *sb,
 
 		/* Initialize block bitmap of the @group */
 		block = group_data[i].block_bitmap;
-		err = ext4_journal_ensure_credits_batch(handle, 1);
+		err = ext4_resize_ensure_credits_batch(handle, 1);
 		if (err < 0)
 			goto out;
 
@@ -607,7 +613,7 @@ static int setup_new_flex_group_blocks(struct super_block *sb,
 
 		/* Initialize inode bitmap of the @group */
 		block = group_data[i].inode_bitmap;
-		err = ext4_journal_ensure_credits_batch(handle, 1);
+		err = ext4_resize_ensure_credits_batch(handle, 1);
 		if (err < 0)
 			goto out;
 		/* Mark unused entries in inode bitmap used */
@@ -1085,7 +1091,7 @@ static void update_backups(struct super_block *sb, sector_t blk_off, char *data,
 		ext4_fsblk_t backup_block;
 
 		/* Out of journal space, and can't get more - abort - so sad */
-		err = ext4_journal_ensure_credits_batch(handle, 1);
+		err = ext4_resize_ensure_credits_batch(handle, 1);
 		if (err < 0)
 			break;
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 48a9dbd27f43..8966a5439a22 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1155,6 +1155,7 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 		}
 
 		err = ext4_journal_ensure_credits_fn(handle, credits, credits,
+			ext4_free_metadata_revoke_credits(parent->i_sb, 1),
 			ext4_xattr_restart_fn(handle, parent, bh, block_csum,
 					      dirty));
 		if (err < 0) {
@@ -2841,7 +2842,8 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 	struct inode *ea_inode;
 	int error;
 
-	error = ext4_journal_ensure_credits(handle, extra_credits);
+	error = ext4_journal_ensure_credits(handle, extra_credits,
+			ext4_free_metadata_revoke_credits(inode->i_sb, 1));
 	if (error < 0) {
 		EXT4_ERROR_INODE(inode, "ensure credits (error %d)", error);
 		goto cleanup;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index d68e9e536814..182c9fe9c0e9 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -1746,15 +1746,16 @@ TRACE_EVENT(ext4_load_inode,
 
 TRACE_EVENT(ext4_journal_start,
 	TP_PROTO(struct super_block *sb, int blocks, int rsv_blocks,
-		 unsigned long IP),
+		 int revoke_creds, unsigned long IP),
 
-	TP_ARGS(sb, blocks, rsv_blocks, IP),
+	TP_ARGS(sb, blocks, rsv_blocks, revoke_creds, IP),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev			)
 		__field(unsigned long,	ip			)
 		__field(	  int,	blocks			)
 		__field(	  int,	rsv_blocks		)
+		__field(	  int,	revoke_creds		)
 	),
 
 	TP_fast_assign(
@@ -1762,11 +1763,13 @@ TRACE_EVENT(ext4_journal_start,
 		__entry->ip		 = IP;
 		__entry->blocks		 = blocks;
 		__entry->rsv_blocks	 = rsv_blocks;
+		__entry->revoke_creds	 = revoke_creds;
 	),
 
-	TP_printk("dev %d,%d blocks, %d rsv_blocks, %d caller %pS",
-		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->blocks, __entry->rsv_blocks, (void *)__entry->ip)
+	TP_printk("dev %d,%d blocks %d, rsv_blocks %d, revoke_creds %d, "
+		  "caller %pS", MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->blocks, __entry->rsv_blocks, __entry->revoke_creds,
+		  (void *)__entry->ip)
 );
 
 TRACE_EVENT(ext4_journal_start_reserved,
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 22/22] jbd2: Provide trace event for handle restarts
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (20 preceding siblings ...)
  2019-10-03 22:06 ` [PATCH 21/22] ext4: Reserve revoke credits for freed blocks Jan Kara
@ 2019-10-03 22:06 ` Jan Kara
  2019-10-21 23:18   ` Theodore Y. Ts'o
  2019-10-19 19:19 ` [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Theodore Y. Ts'o
                   ` (29 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-03 22:06 UTC (permalink / raw)
  To: linux-ext4; +Cc: Ted Tso, Jan Kara

Provide trace event for handle restarts to ease debugging.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c       |  8 +++++++-
 include/trace/events/jbd2.h | 16 +++++++++++++++-
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 66fad49d45df..624c33028663 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -732,6 +732,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int revoke_records,
 	journal_t *journal;
 	tid_t		tid;
 	int		need_to_start;
+	int		ret;
 
 	/* If we've had an abort of any type, don't even think about
 	 * actually doing the restart! */
@@ -757,7 +758,12 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int revoke_records,
 		DIV_ROUND_UP(revoke_records,
 			     journal->j_revoke_records_per_block);
 	handle->h_revoke_credits = revoke_records;
-	return start_this_handle(journal, handle, gfp_mask);
+	ret = start_this_handle(journal, handle, gfp_mask);
+	trace_jbd2_handle_restart(journal->j_fs_dev->bd_dev,
+				 ret ? 0 : handle->h_transaction->t_tid,
+				 handle->h_type, handle->h_line_no,
+				 handle->h_total_credits);
+	return ret;
 }
 EXPORT_SYMBOL(jbd2__journal_restart);
 
diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h
index 2310b259329f..d16a32867f3a 100644
--- a/include/trace/events/jbd2.h
+++ b/include/trace/events/jbd2.h
@@ -133,7 +133,7 @@ TRACE_EVENT(jbd2_submit_inode_data,
 		  (unsigned long) __entry->ino)
 );
 
-TRACE_EVENT(jbd2_handle_start,
+DECLARE_EVENT_CLASS(jbd2_handle_start_class,
 	TP_PROTO(dev_t dev, unsigned long tid, unsigned int type,
 		 unsigned int line_no, int requested_blocks),
 
@@ -161,6 +161,20 @@ TRACE_EVENT(jbd2_handle_start,
 		  __entry->type, __entry->line_no, __entry->requested_blocks)
 );
 
+DEFINE_EVENT(jbd2_handle_start_class, jbd2_handle_start,
+	TP_PROTO(dev_t dev, unsigned long tid, unsigned int type,
+		 unsigned int line_no, int requested_blocks),
+
+	TP_ARGS(dev, tid, type, line_no, requested_blocks)
+);
+
+DEFINE_EVENT(jbd2_handle_start_class, jbd2_handle_restart,
+	TP_PROTO(dev_t dev, unsigned long tid, unsigned int type,
+		 unsigned int line_no, int requested_blocks),
+
+	TP_ARGS(dev, tid, type, line_no, requested_blocks)
+);
+
 TRACE_EVENT(jbd2_handle_extend,
 	TP_PROTO(dev_t dev, unsigned long tid, unsigned int type,
 		 unsigned int line_no, int buffer_credits,
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (21 preceding siblings ...)
  2019-10-03 22:06 ` [PATCH 22/22] jbd2: Provide trace event for handle restarts Jan Kara
@ 2019-10-19 19:19 ` Theodore Y. Ts'o
  2019-10-24 13:09   ` Jan Kara
  2019-11-04  3:32 ` Theodore Y. Ts'o
                   ` (28 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-19 19:19 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

[-- Attachment #1: Type: text/plain, Size: 870 bytes --]

Hi Jan,

I've tried applying this patch set against 5.4-rc3, and I'm finding a
easily reproducible failure using:

	kvm-xfstests -c ext3conv ext4/039

It is the BUG_ON in fs/jbd2/commit.c, around line 570:

	J_ASSERT(commit_transaction->t_nr_buffers <=
		 atomic_read(&commit_transaction->t_outstanding_credits));

The failure (with the obvious debugging printk added) is:

ext4/039		[15:13:16][    6.747101] run fstests ext4/039 at 2019-10
-19 15:13:16
[    7.018766] Mounted ext4 file system at /vdc supports timestamps until 2038 (
0x7fffffff)
[    8.227631] JBD2: t_nr_buffers 226, t_outstanding_credits=223
[    8.229215] ------------[ cut here ]------------
[    8.230249] kernel BUG at fs/jbd2/commit.c:573!
     	       ...

The full log is attached (although the stack trace isn't terribly
interesting, since this is being run out of kjournald2).

						- Ted


[-- Attachment #2: log.201910191513 --]
[-- Type: text/plain, Size: 7057 bytes --]

^[c^[[?7l^[[2J^[[0mSeaBIOS (version 1.12.0-1)
Booting from ROM..^[c^[[?7l^[[2JKERNEL: kernel	5.4.0-rc3-xfstests-00022-g6db1a59f1c6a-dirty #1234 SMP Sat Oct 19 15:10:53 EDT 2019 x86_64
FSTESTVER: blktests	667d741 (Wed, 4 Sep 2019 10:49:18 -0700)
FSTESTVER: e2fsprogs	v1.45.4-15-g4b4f7b35 (Wed, 9 Oct 2019 20:25:01 -0400)
FSTESTVER: fio		fio-3.15 (Fri, 12 Jul 2019 10:40:36 -0600)
FSTESTVER: fsverity	2151209 (Fri, 28 Jun 2019 14:34:41 -0700)
FSTESTVER: ima-evm-utils	0267fa1 (Mon, 3 Dec 2018 06:11:35 -0500)
FSTESTVER: nvme-cli	v1.9 (Thu, 15 Aug 2019 13:14:59 -0600)
FSTESTVER: quota		6e63107 (Thu, 15 Aug 2019 11:23:55 +0200)
FSTESTVER: util-linux	v2.33.2 (Tue, 9 Apr 2019 14:58:07 +0200)
FSTESTVER: xfsprogs	v5.3.0-rc1-8-g7aaa32db (Thu, 19 Sep 2019 13:21:52 -0400)
FSTESTVER: xfstests	linux-v3.8-2563-g45bd2a28 (Mon, 14 Oct 2019 08:11:38 -0400)
FSTESTVER: xfstests-bld	5e2a748 (Mon, 14 Oct 2019 09:36:05 -0400)
FSTESTCFG: "ext3conv"
FSTESTSET: "ext4/039"
FSTESTEXC: ""
FSTESTOPT: "aex"
MNTOPTS: ""
CPUS: "2"
MEM: "1966.97"
              total        used        free      shared  buff/cache   available
Mem:           1966          96        1778           8          92        1830
Swap:             0           0           0
[    5.146878] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
[    5.148619] NFSD: Using legacy client tracking operations.
BEGIN TEST ext3conv (1 test): Ext4 4k block w/nodelalloc and no flex_bg Sat Oct 19 15:13:15 EDT 2019
DEVICE: /dev/vdd
EXT_MKFS_OPTIONS: -O ^flex_bg
EXT_MOUNT_OPTIONS: -o block_validity,nodelalloc
FSTYP         -- ext4
PLATFORM      -- Linux/x86_64 kvm-xfstests 5.4.0-rc3-xfstests-00022-g6db1a59f1c6a-dirty #1234 SMP Sat Oct 19 15:10:53 EDT 2019
MKFS_OPTIONS  -- -q -O ^flex_bg /dev/vdc
MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity,nodelalloc /dev/vdc /vdc

ext4/039		[15:13:16][    6.747101] run fstests ext4/039 at 2019-10-19 15:13:16
[    7.018766] Mounted ext4 file system at /vdc supports timestamps until 2038 (0x7fffffff)
[    8.227631] JBD2: t_nr_buffers 226, t_outstanding_credits=223
[    8.229215] ------------[ cut here ]------------
[    8.230249] kernel BUG at fs/jbd2/commit.c:573!
[    8.231231] invalid opcode: 0000 [#1] SMP NOPTI
[    8.232223] CPU: 1 PID: 1384 Comm: jbd2/vdc-8 Not tainted 5.4.0-rc3-xfstests-00022-g6db1a59f1c6a-dirty #1234
[    8.234303] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[    8.236124] RIP: 0010:jbd2_journal_commit_transaction+0x1565/0x1f35
[    8.237489] Code: f0 fe ff ff 48 c7 c2 b8 52 3a 8c be 4e 00 00 00 48 c7 c7 3f 06 3f 8c c6 05 a8 c5 5a 01 01 e8 8f 15 d4 ff e9 cc fe ff ff 0f 0b <0f> 0b 0f 0b 0f 0b e8 10 a0 d5 ff 85 c0 0f 85 15 fb ff ff 48 c7 c2
[    8.241449] RSP: 0018:ffffaa64c22cbcf0 EFLAGS: 00010283
[    8.242581] RAX: 00000000000000df RBX: ffff8ed0b8b7f028 RCX: 0000000000000000
[    8.244138] RDX: 0000000000000000 RSI: 00000000000000e2 RDI: ffff8ed0bdbd6608
[    8.245664] RBP: ffffaa64c22cbe80 R08: ffff8ed0bdbd6608 R09: 0000000000000000
[    8.247029] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[    8.248023] R13: ffff8ed0b52045a0 R14: ffff8ed0b5204540 R15: ffff8ed0b8b7f000
[    8.249042] FS:  0000000000000000(0000) GS:ffff8ed0bda00000(0000) knlGS:0000000000000000
[    8.250191] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.251007] CR2: 00005601514547c0 CR3: 00000000722c4006 CR4: 0000000000360ee0
[    8.252007] Call Trace:
[    8.252356]  ? __lock_acquire+0x24a/0xf80
[    8.252946]  ? __lock_acquired+0x1eb/0x310
[    8.253574]  ? kjournald2+0xe3/0x3a0
[    8.254108]  kjournald2+0xe3/0x3a0
[    8.254567]  ? replenish_dl_entity.cold+0x1d/0x1d
[    8.255364]  ? __jbd2_debug+0x50/0x50
[    8.255886]  kthread+0x126/0x140
[    8.256408]  ? kthread_delayed_work_timer_fn+0xa0/0xa0
[    8.257324]  ret_from_fork+0x3a/0x50
[    8.257916] ---[ end trace 9acad1489f655cc4 ]---
[    8.258799] RIP: 0010:jbd2_journal_commit_transaction+0x1565/0x1f35
[    8.259742] Code: f0 fe ff ff 48 c7 c2 b8 52 3a 8c be 4e 00 00 00 48 c7 c7 3f 06 3f 8c c6 05 a8 c5 5a 01 01 e8 8f 15 d4 ff e9 cc fe ff ff 0f 0b <0f> 0b 0f 0b 0f 0b e8 10 a0 d5 ff 85 c0 0f 85 15 fb ff ff 48 c7 c2
[    8.264348] RSP: 0018:ffffaa64c22cbcf0 EFLAGS: 00010283
[    8.265803] RAX: 00000000000000df RBX: ffff8ed0b8b7f028 RCX: 0000000000000000
[    8.267610] RDX: 0000000000000000 RSI: 00000000000000e2 RDI: ffff8ed0bdbd6608
[    8.269498] RBP: ffffaa64c22cbe80 R08: ffff8ed0bdbd6608 R09: 0000000000000000
[    8.271295] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[    8.273021] R13: ffff8ed0b52045a0 R14: ffff8ed0b5204540 R15: ffff8ed0b8b7f000
[    8.274792] FS:  0000000000000000(0000) GS:ffff8ed0bda00000(0000) knlGS:0000000000000000
[    8.276789] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.278152] CR2: 00005601514547c0 CR3: 00000000722c4006 CR4: 0000000000360ee0
[    8.279796] ------------[ cut here ]------------
[    8.280909] WARNING: CPU: 1 PID: 1384 at kernel/exit.c:723 do_exit+0x47/0xb70
[    8.282588] CPU: 1 PID: 1384 Comm: jbd2/vdc-8 Tainted: G      D           5.4.0-rc3-xfstests-00022-g6db1a59f1c6a-dirty #1234
[    8.284979] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[    8.286909] RIP: 0010:do_exit+0x47/0xb70
[    8.287809] Code: 00 00 48 89 44 24 38 31 c0 65 48 8b 1c 25 c0 5d 01 00 48 8b 83 20 12 00 00 48 85 c0 74 0e 48 8b 10 48 39 d0 0f 84 71 04 00 00 <0f> 0b 65 44 8b 25 07 fa 15 75 41 81 e4 00 ff 1f 00 44 89 64 24 0c
[    8.290621] RSP: 0018:ffffaa64c22cbee0 EFLAGS: 00010216
[    8.291398] RAX: ffffaa64c22cbdc0 RBX: ffff8ed0b5aa6400 RCX: 0000000000000000
[    8.292582] RDX: ffff8ed0ba5e4548 RSI: 0000000000000000 RDI: 000000000000000b
[    8.293818] RBP: 000000000000000b R08: 0000000000000000 R09: 0000000000000000
[    8.295187] R10: 0000000000000008 R11: ffffaa64c22cba1d R12: 000000000000000b
[    8.296506] R13: 0000000000000002 R14: 0000000000000006 R15: ffff8ed0b5aa6400
[    8.297843] FS:  0000000000000000(0000) GS:ffff8ed0bda00000(0000) knlGS:0000000000000000
[    8.299363] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.300503] CR2: 00005601514547c0 CR3: 00000000722c4006 CR4: 0000000000360ee0
[    8.301841] Call Trace:
[    8.302275]  ? __jbd2_debug+0x50/0x50
[    8.303003]  ? kthread+0x126/0x140
[    8.303754]  rewind_stack_do_exit+0x17/0x20
[    8.304485] irq event stamp: 134503
[    8.305232] hardirqs last  enabled at (134503): [<ffffffff8ae017ea>] trace_hardirqs_on_thunk+0x1a/0x20
[    8.306951] hardirqs last disabled at (134501): [<ffffffff8bc002b0>] __do_softirq+0x2b0/0x42a
[    8.308579] softirqs last  enabled at (134502): [<ffffffff8bc0032a>] __do_softirq+0x32a/0x42a
[    8.310100] softirqs last disabled at (134495): [<ffffffff8aeb84d3>] irq_exit+0xb3/0xc0
[    8.311524] ---[ end trace 9acad1489f655cc5 ]---
QEMU: Terminated
\r

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 04/22] ext4: Fix credit estimate for final inode freeing
  2019-10-03 22:05 ` [PATCH 04/22] ext4: Fix credit estimate for final inode freeing Jan Kara
@ 2019-10-21  1:07   ` Theodore Y. Ts'o
  2019-10-24 10:30     ` Jan Kara
  0 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21  1:07 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, stable

On Fri, Oct 04, 2019 at 12:05:50AM +0200, Jan Kara wrote:
> Estimate for the number of credits needed for final freeing of inode in
> ext4_evict_inode() was to small. We may modify 4 blocks (inode & sb for
> orphan deletion, bitmap & group descriptor for inode freeing) and not
> just 3.

The modification for the inode should already be included in the
calculation for ext4_blocks_for_truncate(), no?  So we only need 3
extra blocks (sb, inode bitmap, and bg descriptor for the inode).

      	     	  		    - Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 01/22] jbd2: Fix possible overflow in jbd2_log_space_left()
  2019-10-03 22:05 ` [PATCH 01/22] jbd2: Fix possible overflow in jbd2_log_space_left() Jan Kara
@ 2019-10-21  1:08   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21  1:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, stable

On Fri, Oct 04, 2019 at 12:05:47AM +0200, Jan Kara wrote:
> When number of free space in the journal is very low, the arithmetic in
> jbd2_log_space_left() could underflow resulting in very high number of
> free blocks and thus triggering assertion failure in transaction commit
> code complaining there's not enough space in the journal:
> 
> J_ASSERT(journal->j_free > 1);
> 
> Properly check for the low number of free blocks.
> 
> CC: stable@vger.kernel.org
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good, you can add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 02/22] jbd2: Fixup stale comment in commit code
  2019-10-03 22:05 ` [PATCH 02/22] jbd2: Fixup stale comment in commit code Jan Kara
@ 2019-10-21  1:08   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21  1:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:05:48AM +0200, Jan Kara wrote:
> jbd2_journal_next_log_block() does not look at
> transaction->t_outstanding_credits. Remove the misleading comment.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good, you can add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 03/22] ext4: Do not iput inode under running transaction in ext4_mkdir()
  2019-10-03 22:05 ` [PATCH 03/22] ext4: Do not iput inode under running transaction in ext4_mkdir() Jan Kara
@ 2019-10-21  1:21   ` Theodore Y. Ts'o
  2019-10-24 10:19     ` Jan Kara
  0 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21  1:21 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, stable

On Fri, Oct 04, 2019 at 12:05:49AM +0200, Jan Kara wrote:
> When ext4_mkdir() fails to add entry into directory, it ends up dropping
> freshly created inode under the running transaction and thus inode
> truncation happens under that transaction. That breaks assumptions that
> ext4_evict_inode() does not get called from a transaction context
> (although I'm not aware of any real issue) and is completely
> unnecessary. Just stop the transaction before dropping inode reference.
> 
> CC: stable@vger.kernel.org
> Signed-off-by: Jan Kara <jack@suse.cz>

If we call ext4_journal_stop(handle) before calling iput(inode),
there's a chance that we could crash with the inode with i_link_counts
== 0, but we won't have yet call ext4_evict_inode() to mark the inode
as free in the inode bitmap.  This would result in a inode leak.

Also, this isn't the only place where we can enter ext4_evict_inode()
with an active handle; the same situation arise in ext4_add_nondir(),
and for the same reason.

So I think the code is right as is.  Do you agree?

	     	       	       	     	- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 05/22] ext4: Fix ext4_should_journal_data() for EA inodes
  2019-10-03 22:05 ` [PATCH 05/22] ext4: Fix ext4_should_journal_data() for EA inodes Jan Kara
@ 2019-10-21  1:38   ` Theodore Y. Ts'o
  2019-10-23 16:55     ` Jan Kara
  0 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21  1:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:05:51AM +0200, Jan Kara wrote:
> Similarly to directories, EA inodes do only journalled modifications to
> their data. Change ext4_should_journal_data() to return true for them so
> that we don't have to special-case them during truncate.

We are already special-casing EA inodes in ext4_clear_blocks() in
fs/ext4/indirect.c, and get_default_free_blocks_flags() in
fs/ext4/extents.c, and like S_ISDIR, we want to treat EA inode blocks
as metadata.   So I'm not sure I see the value of this change?

As an aside, I was looking at fs/ext4/mballoc.c to see what the
difference is for treating a block as a metadata block versus a
journaled data block, and what I found made my hair rise on end:

	/*
	 * We need to make sure we don't reuse the freed block until after the
	 * transaction is committed. We make an exception if the inode is to be
	 * written in writeback mode since writeback mode has weak data
	 * consistency guarantees.
	 */

So in data=writeback, if a file is deleted, its blocks are available
for immediate reallocation, and if we are under heavy memory pressure,
the deleted file's blocks could get overwritten --- even in the case
where we crash and the transaction never committed.

While it's true that date=writeback mode has weaker guarantees, my
understanding is that it only applied to the exposure stale data, and
not to a long-standing file's blocks getting corrupted if it is almost
deleted, but not quite before a crash.

Granted, the situation where this would happen is quite wrare, but it
seems quite wrong....

						- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 06/22] ext4: Use ext4_journal_extend() instead of jbd2_journal_extend()
  2019-10-03 22:05 ` [PATCH 06/22] ext4: Use ext4_journal_extend() instead of jbd2_journal_extend() Jan Kara
@ 2019-10-21  1:39   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21  1:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:05:52AM +0200, Jan Kara wrote:
> Use ext4 helper ext4_journal_extend() instead of opencoding it in
> ext4_try_to_expand_extra_isize().
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good; you can add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 07/22] ext4: Avoid unnecessary revokes in ext4_alloc_branch()
  2019-10-03 22:05 ` [PATCH 07/22] ext4: Avoid unnecessary revokes in ext4_alloc_branch() Jan Kara
@ 2019-10-21 13:39   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 13:39 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:05:53AM +0200, Jan Kara wrote:
> Error cleanup path in ext4_alloc_branch() calls ext4_forget() on freshly
> allocated indirect blocks with 'metadata' set to 1. This results in
> generating revoke records for these blocks. However this is unnecessary
> as the freed blocks are only allocated in the current transaction and
> thus they will never be journalled. Make this cleanup path similar to
> e.g. cleanup in ext4_splice_branch() and use ext4_free_blocks() to
> handle block forgetting by passing EXT4_FREE_BLOCKS_FORGET and not
> EXT4_FREE_BLOCKS_METADATA to ext4_free_blocks(). This also allows
> allocating transaction not to reserve any credits for revoke records.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good, you can add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 08/22] ext4: Provide function to handle transaction restarts
  2019-10-03 22:05 ` [PATCH 08/22] ext4: Provide function to handle transaction restarts Jan Kara
@ 2019-10-21 16:20   ` Theodore Y. Ts'o
  2019-10-23 16:25     ` Jan Kara
  0 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 16:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:05:54AM +0200, Jan Kara wrote:
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index fb0f99dc8c22..32f2c22c7ef2 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c

> +/*
> + * Make sure 'handle' has at least 'check_cred' credits. If not, restart
> + * transaction with 'restart_cred' credits. The function drops i_data_sem
> + * when restarting transaction and gets it after transaction is restarted.
> + *
> + * The function returns 0 on success, 1 if transaction had to be restarted,
> + * and < 0 in case of fatal error.
> + */
> +int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode,
> +				int check_cred, int restart_cred)

This makes me super nervous.  This gets called by ext4_access_path(),
which in turn is called by the insert_range, and collapse_range (among
others) where we previously were not dropping i_data_sem.  This means
we will be dropping i_data_sem while they are in the middle of doing
surgery to the extent tree, which makes me super nervous.

Granted, insert_range and collapse_range take a lot of locks,
including the inode lock, but it's not obvious to me that this is
safe, and at the very least the documentation for ext4_access_path
should have a warning note in its comments that i_data_sem can get
dropped, and its call sites audited if they haven't already.

Thanks,

							- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 09/22] ext4, jbd2: Provide accessor function for handle credits
  2019-10-03 22:05 ` [PATCH 09/22] ext4, jbd2: Provide accessor function for handle credits Jan Kara
@ 2019-10-21 16:21   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 16:21 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:05:55AM +0200, Jan Kara wrote:
> Provide accessor function to get number of credits available in a handle
> and use it from ext4. Later, computation of available credits won't be
> so straightforward.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good, you can add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 10/22] ocfs2: Use accessor function for h_buffer_credits
  2019-10-03 22:05 ` [PATCH 10/22] ocfs2: Use accessor function for h_buffer_credits Jan Kara
@ 2019-10-21 16:21   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 16:21 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:05:56AM +0200, Jan Kara wrote:
> Use the jbd2 accessor function for h_buffer_credits.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good, you can add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 11/22] jbd2: Fix statistics for the number of logged blocks
  2019-10-03 22:05 ` [PATCH 11/22] jbd2: Fix statistics for the number of logged blocks Jan Kara
@ 2019-10-21 16:24   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 16:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:05:57AM +0200, Jan Kara wrote:
> jbd2 statistics counting number of blocks logged in a transaction was
> wrong. It didn't count the commit block and more importantly it didn't
> count revoke descriptor blocks. Make sure these get properly counted.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good, you can add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 12/22] jbd2: Reorganize jbd2_journal_stop()
  2019-10-03 22:05 ` [PATCH 12/22] jbd2: Reorganize jbd2_journal_stop() Jan Kara
@ 2019-10-21 17:29   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 17:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:05:58AM +0200, Jan Kara wrote:
> Move code in jbd2_journal_stop() around a bit. It removes some
> unnecessary code duplication and will make factoring out parts common
> with jbd2__journal_restart() easier.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good.  You can add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>

Thanks!

					- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 13/22] jbd2: Drop pointless check from jbd2_journal_stop()
  2019-10-03 22:05 ` [PATCH 13/22] jbd2: Drop pointless check from jbd2_journal_stop() Jan Kara
@ 2019-10-21 17:30   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 17:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:05:59AM +0200, Jan Kara wrote:
> If a transaction is larger than journal->j_max_transaction_buffers, that
> is a bug and not a trigger for transaction commit. Also the very next
> attempt to start new handle will start transaction commit anyway. So
> just remove the pointless check. Arguably, we could start transaction
> commit whenever the transaction size is *close* to
> journal->j_max_transaction_buffers. This has a potential to reduce
> latency of the next jbd2_journal_start() at the cost of somewhat smaller
> transactions. However for this to have any effect, it would mean that
> there isn't someone already waiting in jbd2_journal_start() which means
> metadata load for the fs is pretty light anyway so probably this
> optimization is not worth it.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good; feel free to add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 14/22] jbd2: Drop pointless wakeup from jbd2_journal_stop()
  2019-10-03 22:06 ` [PATCH 14/22] jbd2: Drop pointless wakeup " Jan Kara
@ 2019-10-21 17:34   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 17:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:06:00AM +0200, Jan Kara wrote:
> When we drop last handle from a transaction and journal->j_barrier_count
> > 0, jbd2_journal_stop() wakes up journal->j_wait_transaction_locked
> wait queue. This looks pointless - wait for outstanding handles always
> happens on journal->j_wait_updates waitqueue.
> journal->j_wait_transaction_locked is used to wait for transaction state
> changes and by start_this_handle() for waiting until
> journal->j_barrier_count drops to 0. The first case is clearly
> irrelevant here since only jbd2 thread changes transaction state. The
> second case looks related but jbd2_journal_unlock_updates() is
> responsible for the wakeup in this case. So just drop the wakeup.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good; feel free to add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 15/22] jbd2: Factor out common parts of stopping and restarting a handle
  2019-10-03 22:06 ` [PATCH 15/22] jbd2: Factor out common parts of stopping and restarting a handle Jan Kara
@ 2019-10-21 17:49   ` Theodore Y. Ts'o
  2019-10-23 16:17     ` Jan Kara
  0 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 17:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:06:01AM +0200, Jan Kara wrote:
> jbd2__journal_restart() has quite some code that is common with
> jbd2_journal_stop(). Factor this functionality into stop_this_handle()
> helper and use it from both functions. Note that this also drops
> t_handle_lock protection from jbd2__journal_restart() as
> jbd2_journal_stop() does the same thing without it.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/jbd2/transaction.c | 94 +++++++++++++++++++++++----------------------------
>  1 file changed, 42 insertions(+), 52 deletions(-)
> 
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index d648cec3f90f..d4ee02e5161b 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -677,52 +704,30 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, gfp_t gfp_mask)

> -	read_lock(&journal->j_state_lock);
> -	spin_lock(&transaction->t_handle_lock);
> -	atomic_sub(handle->h_buffer_credits,
> -		   &transaction->t_outstanding_credits);
> -	if (handle->h_rsv_handle) {
> -		sub_reserved_credits(journal,
> -				     handle->h_rsv_handle->h_buffer_credits);
> -	}
> -	if (atomic_dec_and_test(&transaction->t_updates))
> -		wake_up(&journal->j_wait_updates);
> -	tid = transaction->t_tid;
> -	spin_unlock(&transaction->t_handle_lock);
> +	jbd_debug(2, "restarting handle %p\n", handle);
> +	stop_this_handle(handle);
>  	handle->h_transaction = NULL;
> -	current->journal_info = NULL;
>  
> -	jbd_debug(2, "restarting handle %p\n", handle);
> +	read_lock(&journal->j_state_lock);
>  	need_to_start = !tid_geq(journal->j_commit_request, tid);
>  	read_unlock(&journal->j_state_lock);

What is j_state_lock protecting at this point?  There's only a 32-bit
read of j_commit_request at this point.

						- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 16/22] jbd2: Account descriptor blocks into t_outstanding_credits
  2019-10-03 22:06 ` [PATCH 16/22] jbd2: Account descriptor blocks into t_outstanding_credits Jan Kara
@ 2019-10-21 21:04   ` Theodore Y. Ts'o
  2019-10-23 13:09     ` Jan Kara
  0 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 21:04 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:06:02AM +0200, Jan Kara wrote:
> Currently, journal descriptor blocks were not accounted in
> transaction->t_outstanding_credits and we were just leaving some slack
> space in the journal for them (in jbd2_log_space_left() and
> jbd2_space_needed()). This is making proper accounting (and reservation
> we want to add) of descriptor blocks difficult so switch to accounting
> descriptor blocks in transaction->t_outstanding_credits and just reserve
> the same amount of credits in t_outstanding credits for journal
> descriptor blocks when creating transaction.

This changes the meaning of t_oustanding credits; in particular the
documentation of t_outstanding_credits in include/linux/jbd2.h is no
longer correct, as it currently defines it has containing:

     Number of buffers reserved for use by all handles in this transaction
     handle but not yet modified. [none]

Previously, t_outstanding_credits would go to zero once all of the
handles attached to the transaction were closed.  Now, it is
initialized to j_max_transaction_buffers >> 32, and once all of the
handles are closed t_outstanding_credits will go back to that value.
It then gets decremented as we write each jbd descriptor block
(whether it is for a revoke block or a data block) during the commit
and we throw a warning if we ever write more than j_max_transaction_buffers >> 32
descriptor blocks.

Is that a fair summary of what happens after this commit?

The thing is, I don't see how this helps the rest of the patch series;
we account for space needed for the revoke blocks in later patches,
but I don't see that adjusting t_outstanding credits.  We reserve
extra space for the revoke blocks, and we then account for that space,
but the fact that we have accounted for all of the extra descriptor
blocks in t_outstanding_credits doesn't seem to be changed.  As a
result, we appear to be double-counting the space needed for the
revoke descriptor blocks.  Which is fine; I don't mind the accounting
being a bit more conservative, but I find myself being a bit puzzled
about why this change is necessary or adds value.

What am I missing?

						- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 17/22] jbd2: Drop jbd2_space_needed()
  2019-10-03 22:06 ` [PATCH 17/22] jbd2: Drop jbd2_space_needed() Jan Kara
@ 2019-10-21 21:05   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 21:05 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:06:03AM +0200, Jan Kara wrote:
> The function is now just a trivial wrapper returning
> journal->j_max_transaction_buffers. Drop it.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Assuming that we still need to do patch #16 (see my previous review
about my questions about what value it adds), this makes sense.

Feel free to add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 18/22] jbd2: Reserve space for revoke descriptor blocks
  2019-10-03 22:06 ` [PATCH 18/22] jbd2: Reserve space for revoke descriptor blocks Jan Kara
@ 2019-10-21 21:47   ` Theodore Y. Ts'o
  2019-10-23 13:27     ` Jan Kara
  0 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 21:47 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:06:04AM +0200, Jan Kara wrote:
> Extend functions for starting, extending, and restarting transaction
> handles to take number of revoke records handle must be able to
> accommodate. These functions then make sure transaction has enough
> credits to be able to store resulting revoke descriptor blocks. Also
> revoke code tracks number of revoke records created by a handle to catch
> situation where some place didn't reserve enough space for revoke
> records. Similarly to standard transaction credits, space for unused
> reserved revoke records is released when the handle is stopped.
> 
> On the ext4 side we currently take a simplistic approach of reserving
> space for 1024 revoke records for any transaction. This grows amount of
> credits reserved for each handle only by a few and is enough for any
> normal workload so that we don't hit warnings in jbd2. We will refine
> the logic in following commits.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

So let me summarize the way I think this commit is handling things.

1) When a handle is created, the caller specifies how many revokes it
plans to do.  If during the life of the handle, more than this number
of revokes are done, a warning will be emited.

2) For the purposes of reserving transaction credits, when we start
the handle we assume the worst case number of number of revoke
descriptors necessary, and we reserve that much space, and we add it
to t_oustanding_credits.

3) When we stop the handle, we decrement t_outstanding_credits by the
number of blocks that were originally reserved for this handle --- but
*not* the number of worst case revoke descriptor blocks needed.  Which
means that after the handle is started and then closed,
t_outstanding_credits will be increased by ROUND_UP((max # of revoked
blocks) / # of revoke blocks per block group descriptor).

If we delete a large number of files which are but a single 4k block
in data=journal mode, each deleted file will increase
t_outstanding_credits by one block, even though we won't be using
anywhere *near* that number of blocks for revoke blocks.  So we will
end up closing the transactions *much* earlier than we would have.

It also means that t_outstanding_credits will be a much higher number
that we would ever need, so it's not clear to me why it's worth it to
decrement t_outstanding_credits in jbd2_journal_get_descriptor_buffer()
and warn if it is less than zero.    And it goes back to the question
I had asked earler: "so what is the formal definition of 
t_outstanding_credits after this patch series, anyway"?

						- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 19/22] jbd2: Rename h_buffer_credits to h_total_credits
  2019-10-03 22:06 ` [PATCH 19/22] jbd2: Rename h_buffer_credits to h_total_credits Jan Kara
@ 2019-10-21 21:48   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 21:48 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:06:05AM +0200, Jan Kara wrote:
> The credit counter now contains both buffer and revoke descriptor block
> credits. Rename to counter to h_total_credits to reflect that. No
> functional change.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks fine, feel free to add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 20/22] jbd2: Make credit checking more strict
  2019-10-03 22:06 ` [PATCH 20/22] jbd2: Make credit checking more strict Jan Kara
@ 2019-10-21 22:29   ` Theodore Y. Ts'o
  2019-10-23 13:30     ` Jan Kara
  0 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 22:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:06:06AM +0200, Jan Kara wrote:
> Make checking of available credits in jbd2_journal_dirty_metadata() more
> strict. There should be always enough credits in the handle to write all
> potential revoke descriptors. Also we warn in case there are not enough
> credits since this is a bug in the filesystem.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

This is fine, but I wonder if we should also be returning an error in
jbd2_journal_revoke() --- of course, one problem is ext4_forget() is
getting called from ext4_free_blocks(), which currently doesn't return
an error.  But we can capture the error return in __ext4_forget(), and
at that point we can give a much more useful error message, since we
can print the function caller and line number.

Feel free to add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 21/22] ext4: Reserve revoke credits for freed blocks
  2019-10-03 22:06 ` [PATCH 21/22] ext4: Reserve revoke credits for freed blocks Jan Kara
@ 2019-10-21 23:18   ` Theodore Y. Ts'o
  2019-10-23 16:13     ` Jan Kara
  0 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 23:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:06:07AM +0200, Jan Kara wrote:
> +static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
> +{
> +	if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
> +		return 0;
> +	if (!ext4_should_journal_data(inode))
> +		return 0;
> +	/*
> +	 * Data blocks in one extent are contiguous, just account for partial
> +	 * clusters at extent boundaries
> +	 */
> +	return blocks + 2*EXT4_SB(inode->i_sb)->s_cluster_ratio;
> +}

This looks *way* too conservative.  At the very least, this should be:


	return blocks + 2*(EXT4_SB(inode->i_sb)->s_cluster_ratio - 1);

Since when the cluster ratio is 1, there is no partial clusters at the
extent boundaries, and if bigalloc is enabled, and the cluster ratio
is 16, the worst case of "extra" blocks" at the boundaries would be 15.

It would probably be better to push this up to the callers, since we
can get the exact number by calculating

	(EXT4_B2C(sbi, last) - EXT4_B2C(sbi, first) + 1) * sbi->s_cluster_ratio

This is a bit more complicated in fs/ext4/indirect.c, where we
probably will need to do a min of the these two formulas.



The other thing which I wonder, looking at these, is whether it's
worth it to add a new revoke table format which uses 8 or 12 bytes,
where there is a block number followed by a 32-bit count field (e.g.,
a revoke extent).

I actually suspect that if made the format change, with the revoke
code using the revoke extent table if (a) a new journal feature flag
allows it, and (b) using the revoke extent table would be beneficial,
in the vast majority of cases, that might have addressed the problem
that you saw without having to do the strict tracking of revoke
blocks.  Of course, I'm sure it's still possible to create a worst
case file system and workload where the revoke blocks could still
overflow the journal --- but it would probably be very hard to do and
would only show up in a malicious workload.

What do you think?

					- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 22/22] jbd2: Provide trace event for handle restarts
  2019-10-03 22:06 ` [PATCH 22/22] jbd2: Provide trace event for handle restarts Jan Kara
@ 2019-10-21 23:18   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-21 23:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Fri, Oct 04, 2019 at 12:06:08AM +0200, Jan Kara wrote:
> Provide trace event for handle restarts to ease debugging.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good; feel free to add:

Reviewed-by: Theodore Ts'o <tytso@mit.edu>


^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 16/22] jbd2: Account descriptor blocks into t_outstanding_credits
  2019-10-21 21:04   ` Theodore Y. Ts'o
@ 2019-10-23 13:09     ` Jan Kara
  0 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-10-23 13:09 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4

On Mon 21-10-19 17:04:20, Theodore Y. Ts'o wrote:
> On Fri, Oct 04, 2019 at 12:06:02AM +0200, Jan Kara wrote:
> > Currently, journal descriptor blocks were not accounted in
> > transaction->t_outstanding_credits and we were just leaving some slack
> > space in the journal for them (in jbd2_log_space_left() and
> > jbd2_space_needed()). This is making proper accounting (and reservation
> > we want to add) of descriptor blocks difficult so switch to accounting
> > descriptor blocks in transaction->t_outstanding_credits and just reserve
> > the same amount of credits in t_outstanding credits for journal
> > descriptor blocks when creating transaction.
> 
> This changes the meaning of t_oustanding credits; in particular the
> documentation of t_outstanding_credits in include/linux/jbd2.h is no
> longer correct, as it currently defines it has containing:
> 
>      Number of buffers reserved for use by all handles in this transaction
>      handle but not yet modified. [none]

Right, I can improve the description to better match the new meaning.

> Previously, t_outstanding_credits would go to zero once all of the
> handles attached to the transaction were closed.  Now, it is
> initialized to j_max_transaction_buffers >> 32, and once all of the
> handles are closed t_outstanding_credits will go back to that value.
> It then gets decremented as we write each jbd descriptor block
> (whether it is for a revoke block or a data block) during the commit
> and we throw a warning if we ever write more than j_max_transaction_buffers >> 32
> descriptor blocks.
> 
> Is that a fair summary of what happens after this commit?

Yes, that's correct.

> The thing is, I don't see how this helps the rest of the patch series;
> we account for space needed for the revoke blocks in later patches,
> but I don't see that adjusting t_outstanding credits.  We reserve
> extra space for the revoke blocks, and we then account for that space,
> but the fact that we have accounted for all of the extra descriptor
> blocks in t_outstanding_credits doesn't seem to be changed.  As a
> result, we appear to be double-counting the space needed for the
> revoke descriptor blocks.  Which is fine; I don't mind the accounting
> being a bit more conservative, but I find myself being a bit puzzled
> about why this change is necessary or adds value.

As you properly mentioned above the new meaning of t_outstanding_credits
is meant to be "the amount of space reserved for the transaction in the
journal". This is including all the descriptor blocks the transaction may
need. And it seemed easier to me to change t_outstanding_credits to this
new meaning because later the amount of space reserved for descriptor
blocks stops being constant so we would have to change several places to
use "t_outstanding_credits + t_descritor_credits" instead which gets
especially tricky in cases where we manipulate t_outstanding_credits
atomically (start_this_handle(), add_transaction_credits() in particular).

WRT double-accounting of credits reserved for descriptor blocks: Yes,
revoke descriptor blocks get accounted separately later in the series and
my plan was to shrink the estimate in jbd2_descriptor_blocks_per_trans() at
the end of the series to match the fact that now we need to account only
for commit block and other control blocks which are much more limited.
Which I forgot about in the end. So I will add a patch to do that now.
Thanks for the review!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 18/22] jbd2: Reserve space for revoke descriptor blocks
  2019-10-21 21:47   ` Theodore Y. Ts'o
@ 2019-10-23 13:27     ` Jan Kara
  0 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-10-23 13:27 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4

On Mon 21-10-19 17:47:54, Theodore Y. Ts'o wrote:
> On Fri, Oct 04, 2019 at 12:06:04AM +0200, Jan Kara wrote:
> > Extend functions for starting, extending, and restarting transaction
> > handles to take number of revoke records handle must be able to
> > accommodate. These functions then make sure transaction has enough
> > credits to be able to store resulting revoke descriptor blocks. Also
> > revoke code tracks number of revoke records created by a handle to catch
> > situation where some place didn't reserve enough space for revoke
> > records. Similarly to standard transaction credits, space for unused
> > reserved revoke records is released when the handle is stopped.
> > 
> > On the ext4 side we currently take a simplistic approach of reserving
> > space for 1024 revoke records for any transaction. This grows amount of
> > credits reserved for each handle only by a few and is enough for any
> > normal workload so that we don't hit warnings in jbd2. We will refine
> > the logic in following commits.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> So let me summarize the way I think this commit is handling things.
> 
> 1) When a handle is created, the caller specifies how many revokes it
> plans to do.  If during the life of the handle, more than this number
> of revokes are done, a warning will be emited.

Correct.

> 2) For the purposes of reserving transaction credits, when we start
> the handle we assume the worst case number of number of revoke
> descriptors necessary, and we reserve that much space, and we add it
> to t_oustanding_credits.

Again correct.

> 3) When we stop the handle, we decrement t_outstanding_credits by the
> number of blocks that were originally reserved for this handle --- but
> *not* the number of worst case revoke descriptor blocks needed.  Which
> means that after the handle is started and then closed,
> t_outstanding_credits will be increased by ROUND_UP((max # of revoked
> blocks) / # of revoke blocks per block group descriptor).
> 
> If we delete a large number of files which are but a single 4k block
> in data=journal mode, each deleted file will increase
> t_outstanding_credits by one block, even though we won't be using
> anywhere *near* that number of blocks for revoke blocks.  So we will
> end up closing the transactions *much* earlier than we would have.

Right. Any handle that revokes at least one block will reserve at least one
block for revoke descriptor. I agree that will overestimate number of
necessary revoke blocks heavily in some cases. If you think that's
problematic, I can refine the logic so that rounding errors don't
accumulate that much (probably by tracking exact number of revokes in the
transaction).

> It also means that t_outstanding_credits will be a much higher number
> that we would ever need, so it's not clear to me why it's worth it to
> decrement t_outstanding_credits in jbd2_journal_get_descriptor_buffer()
> and warn if it is less than zero. 

Well, that tracking is a sanity check that we did reserve enough descriptor
blocks for each transaction.

> And it goes back to the question I had asked earler: "so what is the
> formal definition of t_outstanding_credits after this patch series,
> anyway"?

That should be answered in my previous answer.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 20/22] jbd2: Make credit checking more strict
  2019-10-21 22:29   ` Theodore Y. Ts'o
@ 2019-10-23 13:30     ` Jan Kara
  0 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-10-23 13:30 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4

On Mon 21-10-19 18:29:59, Theodore Y. Ts'o wrote:
> On Fri, Oct 04, 2019 at 12:06:06AM +0200, Jan Kara wrote:
> > Make checking of available credits in jbd2_journal_dirty_metadata() more
> > strict. There should be always enough credits in the handle to write all
> > potential revoke descriptors. Also we warn in case there are not enough
> > credits since this is a bug in the filesystem.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> This is fine, but I wonder if we should also be returning an error in
> jbd2_journal_revoke() --- of course, one problem is ext4_forget() is
> getting called from ext4_free_blocks(), which currently doesn't return
> an error.  But we can capture the error return in __ext4_forget(), and
> at that point we can give a much more useful error message, since we
> can print the function caller and line number.

Yeah, that's a good point. I'll add a sanity check to jbd2_journal_revoke()
and then generate some error message in ext4.

> Feel free to add:
> 
> Reviewed-by: Theodore Ts'o <tytso@mit.edu>

Thanks!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 21/22] ext4: Reserve revoke credits for freed blocks
  2019-10-21 23:18   ` Theodore Y. Ts'o
@ 2019-10-23 16:13     ` Jan Kara
  2019-11-04 13:08       ` Theodore Y. Ts'o
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-23 16:13 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4

On Mon 21-10-19 19:18:18, Theodore Y. Ts'o wrote:
> On Fri, Oct 04, 2019 at 12:06:07AM +0200, Jan Kara wrote:
> > +static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
> > +{
> > +	if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
> > +		return 0;
> > +	if (!ext4_should_journal_data(inode))
> > +		return 0;
> > +	/*
> > +	 * Data blocks in one extent are contiguous, just account for partial
> > +	 * clusters at extent boundaries
> > +	 */
> > +	return blocks + 2*EXT4_SB(inode->i_sb)->s_cluster_ratio;
> > +}
> 
> This looks *way* too conservative.  At the very least, this should be:
> 
> 
> 	return blocks + 2*(EXT4_SB(inode->i_sb)->s_cluster_ratio - 1);
> 
> Since when the cluster ratio is 1, there is no partial clusters at the
> extent boundaries, and if bigalloc is enabled, and the cluster ratio
> is 16, the worst case of "extra" blocks" at the boundaries would be 15.

OK, I will update that. Thanks for correction.

> It would probably be better to push this up to the callers, since we
> can get the exact number by calculating
> 
> 	(EXT4_B2C(sbi, last) - EXT4_B2C(sbi, first) + 1) * sbi->s_cluster_ratio
> 
> This is a bit more complicated in fs/ext4/indirect.c, where we
> probably will need to do a min of the these two formulas.

Is it worth the complexity at the callers? If we don't use some reserved
revoke credits, we'll just return them back. And the truncate code
generally works one extent at a time so in the end we may have just asked
for 1 more descriptor block than strictly necessary while the handle is
running...

> The other thing which I wonder, looking at these, is whether it's
> worth it to add a new revoke table format which uses 8 or 12 bytes,
> where there is a block number followed by a 32-bit count field (e.g.,
> a revoke extent).
> 
> I actually suspect that if made the format change, with the revoke
> code using the revoke extent table if (a) a new journal feature flag
> allows it, and (b) using the revoke extent table would be beneficial,
> in the vast majority of cases, that might have addressed the problem
> that you saw without having to do the strict tracking of revoke
> blocks.  Of course, I'm sure it's still possible to create a worst
> case file system and workload where the revoke blocks could still
> overflow the journal --- but it would probably be very hard to do and
> would only show up in a malicious workload.
> 
> What do you think?

Yes, I was thinking about the same. Extent format of revoke blocks would
certainly reduce the number of revoke descriptor blocks in the average
case. On the other hand I think that especially large directories can be
pretty fragmented so it isn't clear how big the average win would be. And
as you say the worst case estimate would not really change substantially
with the different format so to make the filesystem resistent to malicious
attacker we need some form of reservation of revoke descriptor blocks
anyway. So in the end I've decided to go without on-disk format change for
now.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 15/22] jbd2: Factor out common parts of stopping and restarting a handle
  2019-10-21 17:49   ` Theodore Y. Ts'o
@ 2019-10-23 16:17     ` Jan Kara
  2019-11-04 12:36       ` Theodore Y. Ts'o
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-23 16:17 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4

On Mon 21-10-19 13:49:33, Theodore Y. Ts'o wrote:
> On Fri, Oct 04, 2019 at 12:06:01AM +0200, Jan Kara wrote:
> > jbd2__journal_restart() has quite some code that is common with
> > jbd2_journal_stop(). Factor this functionality into stop_this_handle()
> > helper and use it from both functions. Note that this also drops
> > t_handle_lock protection from jbd2__journal_restart() as
> > jbd2_journal_stop() does the same thing without it.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/jbd2/transaction.c | 94 +++++++++++++++++++++++----------------------------
> >  1 file changed, 42 insertions(+), 52 deletions(-)
> > 
> > diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> > index d648cec3f90f..d4ee02e5161b 100644
> > --- a/fs/jbd2/transaction.c
> > +++ b/fs/jbd2/transaction.c
> > @@ -677,52 +704,30 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, gfp_t gfp_mask)
> 
> > -	read_lock(&journal->j_state_lock);
> > -	spin_lock(&transaction->t_handle_lock);
> > -	atomic_sub(handle->h_buffer_credits,
> > -		   &transaction->t_outstanding_credits);
> > -	if (handle->h_rsv_handle) {
> > -		sub_reserved_credits(journal,
> > -				     handle->h_rsv_handle->h_buffer_credits);
> > -	}
> > -	if (atomic_dec_and_test(&transaction->t_updates))
> > -		wake_up(&journal->j_wait_updates);
> > -	tid = transaction->t_tid;
> > -	spin_unlock(&transaction->t_handle_lock);
> > +	jbd_debug(2, "restarting handle %p\n", handle);
> > +	stop_this_handle(handle);
> >  	handle->h_transaction = NULL;
> > -	current->journal_info = NULL;
> >  
> > -	jbd_debug(2, "restarting handle %p\n", handle);
> > +	read_lock(&journal->j_state_lock);
> >  	need_to_start = !tid_geq(journal->j_commit_request, tid);
> >  	read_unlock(&journal->j_state_lock);
> 
> What is j_state_lock protecting at this point?  There's only a 32-bit
> read of j_commit_request at this point.

We could almost drop the lock. To be fully correct, we'd then need to use
READ_ONCE here and WRITE_ONCE in places changing j_commit_request (reasons
are well summarized in recent LWN series on how compiler can screw your
unlocked reads and writes). So probably a fair cleanup but something I've
decided to leave for later.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 08/22] ext4: Provide function to handle transaction restarts
  2019-10-21 16:20   ` Theodore Y. Ts'o
@ 2019-10-23 16:25     ` Jan Kara
  0 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-10-23 16:25 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4

On Mon 21-10-19 12:20:46, Theodore Y. Ts'o wrote:
> On Fri, Oct 04, 2019 at 12:05:54AM +0200, Jan Kara wrote:
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index fb0f99dc8c22..32f2c22c7ef2 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> 
> > +/*
> > + * Make sure 'handle' has at least 'check_cred' credits. If not, restart
> > + * transaction with 'restart_cred' credits. The function drops i_data_sem
> > + * when restarting transaction and gets it after transaction is restarted.
> > + *
> > + * The function returns 0 on success, 1 if transaction had to be restarted,
> > + * and < 0 in case of fatal error.
> > + */
> > +int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode,
> > +				int check_cred, int restart_cred)
> 
> This makes me super nervous.  This gets called by ext4_access_path(),
> which in turn is called by the insert_range, and collapse_range (among
> others) where we previously were not dropping i_data_sem.  This means
> we will be dropping i_data_sem while they are in the middle of doing
> surgery to the extent tree, which makes me super nervous.

But this patch changes nothing in that regard. Previously,
ext4_access_path() was using ext4_ext_truncate_extend_restart() which
called ext4_truncate_restart_trans() which was dropping i_data_sem as well.

> Granted, insert_range and collapse_range take a lot of locks,
> including the inode lock, but it's not obvious to me that this is
> safe, and at the very least the documentation for ext4_access_path
> should have a warning note in its comments that i_data_sem can get
> dropped, and its call sites audited if they haven't already.

Yeah, comment about that would be nice. I can add that when touching this
function anyway.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 05/22] ext4: Fix ext4_should_journal_data() for EA inodes
  2019-10-21  1:38   ` Theodore Y. Ts'o
@ 2019-10-23 16:55     ` Jan Kara
  0 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-10-23 16:55 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4

On Sun 20-10-19 21:38:42, Theodore Y. Ts'o wrote:
> On Fri, Oct 04, 2019 at 12:05:51AM +0200, Jan Kara wrote:
> > Similarly to directories, EA inodes do only journalled modifications to
> > their data. Change ext4_should_journal_data() to return true for them so
> > that we don't have to special-case them during truncate.
> 
> We are already special-casing EA inodes in ext4_clear_blocks() in
> fs/ext4/indirect.c, and get_default_free_blocks_flags() in
> fs/ext4/extents.c, and like S_ISDIR, we want to treat EA inode blocks
> as metadata.   So I'm not sure I see the value of this change?

Firstly, ext4_should_journal_data() should tell whether inode's data blocks
are modified through journalling. So as a principle of least surprise it
should return true for EA inodes because that's how data blocks of those
inodes are modified.

Secondly, once ext4_should_journal_data() is fixed by this patch, I think
that we can just drop that special-casing from ext4_clear_blocks() and
get_default_free_blocks_flags() and just have there:

	if (ext4_should_journal_data(inode))
		flags |= EXT4_FREE_BLOCKS_FORGET;

> As an aside, I was looking at fs/ext4/mballoc.c to see what the
> difference is for treating a block as a metadata block versus a
> journaled data block, and what I found made my hair rise on end:
> 
> 	/*
> 	 * We need to make sure we don't reuse the freed block until after the
> 	 * transaction is committed. We make an exception if the inode is to be
> 	 * written in writeback mode since writeback mode has weak data
> 	 * consistency guarantees.
> 	 */
> 
> So in data=writeback, if a file is deleted, its blocks are available
> for immediate reallocation, and if we are under heavy memory pressure,
> the deleted file's blocks could get overwritten --- even in the case
> where we crash and the transaction never committed.
> 
> While it's true that date=writeback mode has weaker guarantees, my
> understanding is that it only applied to the exposure stale data, and
> not to a long-standing file's blocks getting corrupted if it is almost
> deleted, but not quite before a crash.
> 
> Granted, the situation where this would happen is quite wrare, but it
> seems quite wrong....

I've always considered data=writeback as: You don't know what the data is
going to be if the file was touched shortly before crashing (i.e., similar
to old ext2 non-guarantees).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 03/22] ext4: Do not iput inode under running transaction in ext4_mkdir()
  2019-10-21  1:21   ` Theodore Y. Ts'o
@ 2019-10-24 10:19     ` Jan Kara
  2019-10-24 12:09       ` Theodore Y. Ts'o
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-24 10:19 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4, stable

On Sun 20-10-19 21:21:05, Theodore Y. Ts'o wrote:
> On Fri, Oct 04, 2019 at 12:05:49AM +0200, Jan Kara wrote:
> > When ext4_mkdir() fails to add entry into directory, it ends up dropping
> > freshly created inode under the running transaction and thus inode
> > truncation happens under that transaction. That breaks assumptions that
> > ext4_evict_inode() does not get called from a transaction context
> > (although I'm not aware of any real issue) and is completely
> > unnecessary. Just stop the transaction before dropping inode reference.
> > 
> > CC: stable@vger.kernel.org
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> If we call ext4_journal_stop(handle) before calling iput(inode),
> there's a chance that we could crash with the inode with i_link_counts
> == 0, but we won't have yet call ext4_evict_inode() to mark the inode
> as free in the inode bitmap.  This would result in a inode leak.
> 
> Also, this isn't the only place where we can enter ext4_evict_inode()
> with an active handle; the same situation arise in ext4_add_nondir(),
> and for the same reason.
> 
> So I think the code is right as is.  Do you agree?

Correct on both points. Thanks for spotting this! Now I still don't think
that calling iput() with running transaction is good. It complicates
matters with revoke record reservation but it is also fragile for other
reasons - e.g. flush worker could find the allocated inode just before we
will call iput() on it, try to write it out, block on starting transaction
and we get a deadlock with inode_wait_for_writeback() inside evict(). Now
inode *probably* won't be dirty yet by the time we get to ext4_add_nondir()
or similar, that's why I say above it's just fragile, not an outright bug.

So I'd still prefer to do the iput() outside of transaction and we can
protect from leaking the inode in case of crash by adding it to orphan
list. I'll update the patch. Thanks for review!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 04/22] ext4: Fix credit estimate for final inode freeing
  2019-10-21  1:07   ` Theodore Y. Ts'o
@ 2019-10-24 10:30     ` Jan Kara
  0 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-10-24 10:30 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4, stable

On Sun 20-10-19 21:07:23, Theodore Y. Ts'o wrote:
> On Fri, Oct 04, 2019 at 12:05:50AM +0200, Jan Kara wrote:
> > Estimate for the number of credits needed for final freeing of inode in
> > ext4_evict_inode() was to small. We may modify 4 blocks (inode & sb for
> > orphan deletion, bitmap & group descriptor for inode freeing) and not
> > just 3.
> 
> The modification for the inode should already be included in the
> calculation for ext4_blocks_for_truncate(), no?  So we only need 3
> extra blocks (sb, inode bitmap, and bg descriptor for the inode).

Yes, but 'extra_credits' is also passed to ext4_xattr_delete_inode() and if
that needs to restart a transaction, it needs to reserve enough for inode
modification in that new transaction. This patch is actually a result of
assertion checks I was getting with more accurate transaction restart
handling implemented later in this series...

I agree we can actually subtract 3 from
ext4_blocks_for_truncate(inode)+extra_credits when starting the initial
transaction as inode changes get double-accounted there. I can do that
and I'll also update the changelog to explain this better.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 03/22] ext4: Do not iput inode under running transaction in ext4_mkdir()
  2019-10-24 10:19     ` Jan Kara
@ 2019-10-24 12:09       ` Theodore Y. Ts'o
  2019-10-24 13:37         ` Jan Kara
  0 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-24 12:09 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, stable

On Thu, Oct 24, 2019 at 12:19:06PM +0200, Jan Kara wrote:
> Correct on both points. Thanks for spotting this! Now I still don't think
> that calling iput() with running transaction is good. It complicates
> matters with revoke record reservation but it is also fragile for other
> reasons - e.g. flush worker could find the allocated inode just before we
> will call iput() on it, try to write it out, block on starting transaction
> and we get a deadlock with inode_wait_for_writeback() inside evict(). Now
> inode *probably* won't be dirty yet by the time we get to ext4_add_nondir()
> or similar, that's why I say above it's just fragile, not an outright bug.

But we don't ever write the inode itself via
inode_wait_for_writeback(), because how ext4 journalling works.  (See
the comments before ext4_mark_inode_dirty()).  And for the special
inodes (directories, device nodes, etc.) there's no data dirtyness to
worry about.  For regular files, we hit this code path when have just
created the inode, but were not able to add a link to the parent
directory; the fd wasn't been released to userspace yet, so it can't
be data dirty either.

So unless I'm missing something, I don't think the deadlock described
above is possible?

We can certainly add it to the orphan list if it's necessary, but it's
extra overhead and adds a global contention point.  So if it's not
necessary, I'd rather avoid it if possible, and I think it's safe to
do so in this case.

						- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors
  2019-10-19 19:19 ` [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Theodore Y. Ts'o
@ 2019-10-24 13:09   ` Jan Kara
  2019-10-24 15:12     ` Jan Kara
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-24 13:09 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4

On Sat 19-10-19 15:19:33, Theodore Y. Ts'o wrote:
> Hi Jan,
> 
> I've tried applying this patch set against 5.4-rc3, and I'm finding a
> easily reproducible failure using:
> 
> 	kvm-xfstests -c ext3conv ext4/039
> 
> It is the BUG_ON in fs/jbd2/commit.c, around line 570:
> 
> 	J_ASSERT(commit_transaction->t_nr_buffers <=
> 		 atomic_read(&commit_transaction->t_outstanding_credits));
> 
> The failure (with the obvious debugging printk added) is:
> 
> ext4/039		[15:13:16][    6.747101] run fstests ext4/039 at 2019-10
> -19 15:13:16
> [    7.018766] Mounted ext4 file system at /vdc supports timestamps until 2038 (
> 0x7fffffff)
> [    8.227631] JBD2: t_nr_buffers 226, t_outstanding_credits=223
> [    8.229215] ------------[ cut here ]------------
> [    8.230249] kernel BUG at fs/jbd2/commit.c:573!
>      	       ...
> 
> The full log is attached (although the stack trace isn't terribly
> interesting, since this is being run out of kjournald2).

Thanks! Somehow this escaped my testing although I thought I have run ext3
configuration... Anyway we are reserving too few space in this case - with
some debugging added:

[   80.296029] t_buffers: 222, t_outstanding_credits: 219,
t_revoke_written: 23, t_revoke_reserved: 12, t_revoke_records_written
11432, t_revoke_records_reserved 11432, revokes_per_block: 1020

Which is really puzzling because it would suggest that revokes_per_block is
actually wrong. Digging more into this.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 03/22] ext4: Do not iput inode under running transaction in ext4_mkdir()
  2019-10-24 12:09       ` Theodore Y. Ts'o
@ 2019-10-24 13:37         ` Jan Kara
  2019-11-04 12:35           ` Theodore Y. Ts'o
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-10-24 13:37 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4, stable

On Thu 24-10-19 08:09:58, Theodore Y. Ts'o wrote:
> On Thu, Oct 24, 2019 at 12:19:06PM +0200, Jan Kara wrote:
> > Correct on both points. Thanks for spotting this! Now I still don't think
> > that calling iput() with running transaction is good. It complicates
> > matters with revoke record reservation but it is also fragile for other
> > reasons - e.g. flush worker could find the allocated inode just before we
> > will call iput() on it, try to write it out, block on starting transaction
> > and we get a deadlock with inode_wait_for_writeback() inside evict(). Now
> > inode *probably* won't be dirty yet by the time we get to ext4_add_nondir()
> > or similar, that's why I say above it's just fragile, not an outright bug.
> 
> But we don't ever write the inode itself via
> inode_wait_for_writeback(), because how ext4 journalling works.  (See
> the comments before ext4_mark_inode_dirty()).  And for the special
> inodes (directories, device nodes, etc.) there's no data dirtyness to
> worry about.  For regular files, we hit this code path when have just
> created the inode, but were not able to add a link to the parent
> directory; the fd wasn't been released to userspace yet, so it can't
> be data dirty either.
> 
> So unless I'm missing something, I don't think the deadlock described
> above is possible?

Actually, now that I look at it, large symlinks may be prone to this
deadlock. There we create unlinked inode, add it to orphan list, stop
transaction, call __page_symlink() which will dirty the inode through
mark_inode_dirty(), then we start transaction and call ext4_add_nondir()
which may call iput() while the transaction is started.

Granted we can fix just ext4_symlink() but it kind of demonstrates my point
that calling iput() under transaction is fragile - some of the stuff done
on last iput generaly ranks above transaction start, just in cases we clean
up failed create none of them happens to block currently (except for the
symlink case mentioned above). And also lockdep does not track dependencies
like inode_wait_for_writeback() as otherwise it would complain as well.

> We can certainly add it to the orphan list if it's necessary, but it's
> extra overhead and adds a global contention point.  So if it's not
> necessary, I'd rather avoid it if possible, and I think it's safe to
> do so in this case.

As this is error cleanup path (only EIO and ENOSPC are realistic failure
cases AFAICT) I don't think performance really matters here. I certainly
don't want to add inode to orphan list in the fast path. I agree that would
be non-starter. I'll try to write a patch and we'll see how bad it will be.
If you still hate it, I can have a look into how bad it would be to fix
ext4_symlink() and somehow deal with revoke reservation issues.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors
  2019-10-24 13:09   ` Jan Kara
@ 2019-10-24 15:12     ` Jan Kara
  0 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-10-24 15:12 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4

On Thu 24-10-19 15:09:08, Jan Kara wrote:
> On Sat 19-10-19 15:19:33, Theodore Y. Ts'o wrote:
> > Hi Jan,
> > 
> > I've tried applying this patch set against 5.4-rc3, and I'm finding a
> > easily reproducible failure using:
> > 
> > 	kvm-xfstests -c ext3conv ext4/039
> > 
> > It is the BUG_ON in fs/jbd2/commit.c, around line 570:
> > 
> > 	J_ASSERT(commit_transaction->t_nr_buffers <=
> > 		 atomic_read(&commit_transaction->t_outstanding_credits));
> > 
> > The failure (with the obvious debugging printk added) is:
> > 
> > ext4/039		[15:13:16][    6.747101] run fstests ext4/039 at 2019-10
> > -19 15:13:16
> > [    7.018766] Mounted ext4 file system at /vdc supports timestamps until 2038 (
> > 0x7fffffff)
> > [    8.227631] JBD2: t_nr_buffers 226, t_outstanding_credits=223
> > [    8.229215] ------------[ cut here ]------------
> > [    8.230249] kernel BUG at fs/jbd2/commit.c:573!
> >      	       ...
> > 
> > The full log is attached (although the stack trace isn't terribly
> > interesting, since this is being run out of kjournald2).
> 
> Thanks! Somehow this escaped my testing although I thought I have run ext3
> configuration... Anyway we are reserving too few space in this case - with
> some debugging added:
> 
> [   80.296029] t_buffers: 222, t_outstanding_credits: 219,
> t_revoke_written: 23, t_revoke_reserved: 12, t_revoke_records_written
> 11432, t_revoke_records_reserved 11432, revokes_per_block: 1020
> 
> Which is really puzzling because it would suggest that revokes_per_block is
> actually wrong. Digging more into this.

Yeah, ext4 was updating journal features in this case but
j_revoke_records_per_block didn't get properly updated. Fixed now.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (22 preceding siblings ...)
  2019-10-19 19:19 ` [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Theodore Y. Ts'o
@ 2019-11-04  3:32 ` Theodore Y. Ts'o
  2019-11-04 11:22   ` Jan Kara
  2019-11-05 16:44 ` [PATCH 0/25 " Jan Kara
                   ` (27 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-11-04  3:32 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

Hi Jan,

I believe that I'm waiting for the v4 version of this series with some
pending fixes that you are planning on making.  Is that correct?

Thanks!!

					- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors
  2019-11-04  3:32 ` Theodore Y. Ts'o
@ 2019-11-04 11:22   ` Jan Kara
  2019-11-04 13:09     ` Theodore Y. Ts'o
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-11-04 11:22 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4

Hi Ted!

On Sun 03-11-19 22:32:52, Theodore Y. Ts'o wrote:
> I believe that I'm waiting for the v4 version of this series with some
> pending fixes that you are planning on making.  Is that correct?

Ah, good that you pinged me because I have the series ready but I was
waiting for your answers to some explanations... In particular discussion
around patch 3 (move iput() outside of transaction), patch 15 (dropping of
j_state_lock around t_tid load), patch 18 (possible large overreservation
of descriptor blocks due to rounding), and 21 (change of on-disk format for
revoke descriptors).

Out of these I probably find the overreservation due to rounding the most
serious and easy enough to handle so I'll fix that and then resend the
series unless you raise your objection also in some other case.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 03/22] ext4: Do not iput inode under running transaction in ext4_mkdir()
  2019-10-24 13:37         ` Jan Kara
@ 2019-11-04 12:35           ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-11-04 12:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, stable

On Thu, Oct 24, 2019 at 03:37:01PM +0200, Jan Kara wrote:
> > We can certainly add it to the orphan list if it's necessary, but it's
> > extra overhead and adds a global contention point.  So if it's not
> > necessary, I'd rather avoid it if possible, and I think it's safe to
> > do so in this case.
> 
> As this is error cleanup path (only EIO and ENOSPC are realistic failure
> cases AFAICT) I don't think performance really matters here. I certainly
> don't want to add inode to orphan list in the fast path. I agree that would
> be non-starter. I'll try to write a patch and we'll see how bad it will be.
> If you still hate it, I can have a look into how bad it would be to fix
> ext4_symlink() and somehow deal with revoke reservation issues.

That seems fair; I agree, adding inodes to the list in the error path
shouldn't be an issue.

					- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 15/22] jbd2: Factor out common parts of stopping and restarting a handle
  2019-10-23 16:17     ` Jan Kara
@ 2019-11-04 12:36       ` Theodore Y. Ts'o
  2019-11-04 12:59         ` Jan Kara
  0 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-11-04 12:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Wed, Oct 23, 2019 at 06:17:24PM +0200, Jan Kara wrote:
> > What is j_state_lock protecting at this point?  There's only a 32-bit
> > read of j_commit_request at this point.
> 
> We could almost drop the lock. To be fully correct, we'd then need to use
> READ_ONCE here and WRITE_ONCE in places changing j_commit_request (reasons
> are well summarized in recent LWN series on how compiler can screw your
> unlocked reads and writes). So probably a fair cleanup but something I've
> decided to leave for later.

Fair enough; maybe leave a quick TODO comment so we remember that this
is an outstanding clean up?

						- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 15/22] jbd2: Factor out common parts of stopping and restarting a handle
  2019-11-04 12:36       ` Theodore Y. Ts'o
@ 2019-11-04 12:59         ` Jan Kara
  0 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-04 12:59 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4

On Mon 04-11-19 07:36:50, Theodore Y. Ts'o wrote:
> On Wed, Oct 23, 2019 at 06:17:24PM +0200, Jan Kara wrote:
> > > What is j_state_lock protecting at this point?  There's only a 32-bit
> > > read of j_commit_request at this point.
> > 
> > We could almost drop the lock. To be fully correct, we'd then need to use
> > READ_ONCE here and WRITE_ONCE in places changing j_commit_request (reasons
> > are well summarized in recent LWN series on how compiler can screw your
> > unlocked reads and writes). So probably a fair cleanup but something I've
> > decided to leave for later.
> 
> Fair enough; maybe leave a quick TODO comment so we remember that this
> is an outstanding clean up?

Good idea. Added.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 21/22] ext4: Reserve revoke credits for freed blocks
  2019-10-23 16:13     ` Jan Kara
@ 2019-11-04 13:08       ` Theodore Y. Ts'o
  2019-11-05  8:31         ` Jan Kara
  0 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-11-04 13:08 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Wed, Oct 23, 2019 at 06:13:14PM +0200, Jan Kara wrote:
> > It would probably be better to push this up to the callers, since we
> > can get the exact number by calculating
> > 
> > 	(EXT4_B2C(sbi, last) - EXT4_B2C(sbi, first) + 1) * sbi->s_cluster_ratio
> > 
> > This is a bit more complicated in fs/ext4/indirect.c, where we
> > probably will need to do a min of the these two formulas.
> 
> Is it worth the complexity at the callers? If we don't use some reserved
> revoke credits, we'll just return them back. And the truncate code
> generally works one extent at a time so in the end we may have just asked
> for 1 more descriptor block than strictly necessary while the handle is
> running...

Sure, this is a change we can make later if we think it's necessary.
Bigalloc file systems aren't that common, and when they are used, most
of the time people aren't creating large numbers of small files and/or
directories.

> Yes, I was thinking about the same. Extent format of revoke blocks would
> certainly reduce the number of revoke descriptor blocks in the average
> case. On the other hand I think that especially large directories can be
> pretty fragmented so it isn't clear how big the average win would be. And
> as you say the worst case estimate would not really change substantially
> with the different format so to make the filesystem resistent to malicious
> attacker we need some form of reservation of revoke descriptor blocks
> anyway. So in the end I've decided to go without on-disk format change for
> now.

Adding a new on-disk journal format is easier than making other ext4
format changes, since the journal is transient, and the case where the
user is simultaneously (a) rolling back to an older kernel which might
not support the new journal feature, and (b) crashes so that journal
replay is necessary, and (c) it's the root file system, so e2fsck
can't take care of the journal replay is a pretty rare / edge case.

That being said, we can set that aside as a possible later
enhancement.  I suspect the main place we would have the large
contiguous range fo blocks to be revoked is the data=journal case, and
one of the things I keep wondering about how much is it worth it to
keep that code.  So long as it's not posing a code maintenance burden,
I don't mind that much; but I also wonder how many people are actually
using it in practice.

Out of curiosity, how easily were you able to trigger the revoke
overflow situation using normal directories?  I would have expected it
would have been fairly difficult to do, except for large file
deletions with data=journal?

						- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors
  2019-11-04 11:22   ` Jan Kara
@ 2019-11-04 13:09     ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-11-04 13:09 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Mon, Nov 04, 2019 at 12:22:32PM +0100, Jan Kara wrote:
> Hi Ted!
> 
> On Sun 03-11-19 22:32:52, Theodore Y. Ts'o wrote:
> > I believe that I'm waiting for the v4 version of this series with some
> > pending fixes that you are planning on making.  Is that correct?
> 
> Ah, good that you pinged me because I have the series ready but I was
> waiting for your answers to some explanations... In particular discussion
> around patch 3 (move iput() outside of transaction), patch 15 (dropping of
> j_state_lock around t_tid load), patch 18 (possible large overreservation
> of descriptor blocks due to rounding), and 21 (change of on-disk format for
> revoke descriptors).

Sorry, I didn't comment because I accepted your arguments; but I guess
I should have said so explicitly.  I just replied to those threads.

> Out of these I probably find the overreservation due to rounding the most
> serious and easy enough to handle so I'll fix that and then resend the
> series unless you raise your objection also in some other case.

Great!

					- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 21/22] ext4: Reserve revoke credits for freed blocks
  2019-11-04 13:08       ` Theodore Y. Ts'o
@ 2019-11-05  8:31         ` Jan Kara
  0 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05  8:31 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4

On Mon 04-11-19 08:08:23, Theodore Y. Ts'o wrote:
> On Wed, Oct 23, 2019 at 06:13:14PM +0200, Jan Kara wrote:
> > Yes, I was thinking about the same. Extent format of revoke blocks would
> > certainly reduce the number of revoke descriptor blocks in the average
> > case. On the other hand I think that especially large directories can be
> > pretty fragmented so it isn't clear how big the average win would be. And
> > as you say the worst case estimate would not really change substantially
> > with the different format so to make the filesystem resistent to malicious
> > attacker we need some form of reservation of revoke descriptor blocks
> > anyway. So in the end I've decided to go without on-disk format change for
> > now.
> 
> Adding a new on-disk journal format is easier than making other ext4
> format changes, since the journal is transient, and the case where the
> user is simultaneously (a) rolling back to an older kernel which might
> not support the new journal feature, and (b) crashes so that journal
> replay is necessary, and (c) it's the root file system, so e2fsck
> can't take care of the journal replay is a pretty rare / edge case.

Yeah, agreed.

> That being said, we can set that aside as a possible later
> enhancement.  I suspect the main place we would have the large
> contiguous range fo blocks to be revoked is the data=journal case, and
> one of the things I keep wondering about how much is it worth it to
> keep that code.  So long as it's not posing a code maintenance burden,
> I don't mind that much; but I also wonder how many people are actually
> using it in practice.

Agreed as well. From time to time I spot some data=journal users but they
are really rare.

> Out of curiosity, how easily were you able to trigger the revoke
> overflow situation using normal directories?  I would have expected it
> would have been fairly difficult to do, except for large file
> deletions with data=journal?

Yes, triggering the situation with normal directories is not easy. Although
with 1k blocksize, already deleting 32MB worth of directories, which isn't
that huge, overruns the default reserve we have for descriptor blocks by
quite a bit. Now if that happens in a situation where the transaction was
just about to fit in the free space in the journal, you get the assertion
failure. I didn't try to reproduce this since triggering the assertion with
data=journal is much faster and easier (so that's what I used for testing)
but our customer accidentally hit this so it's certainly possible (although
he was the first one in all those years).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 0/25 v3] ext4: Fix transaction overflow due to revoke descriptors
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (23 preceding siblings ...)
  2019-11-04  3:32 ` Theodore Y. Ts'o
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 01/25] jbd2: Fix possible overflow in jbd2_log_space_left() Jan Kara
                   ` (26 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Hello,

Here is v3 of this series with couple more bugs fixed. Now all failed tests Ted
higlighted pass for me.

Changes since v3:
* Added reviewed-by tags
* Added update of precomputed number of revoke records per descriptor block
  when journal features change
* Fix possible inode leak in ext4_add_mkdir() in case of crash
* Fix other cases where last inode reference gets dropped while the transaction
  is running
* Improve comment about the new meaning of t_outstanding_credits
* Added patch to reduce estimate on the base number of reserved descriptor
  blocks
* jbd2_journal_revoke() now returns error in case there are not enough revoke
  credits so that ext4 can abort handle etc.
* Improve estimate on the number of necessary revoke records for truncate
* Add fix for too small array of journal buffers
* Account number of revoke records in the transaction to avoid overestimation
  of necessary revoke descriptor blocks

Changes since v2:
* Fixed bug in revoke credit estimates for extent freeing in bigalloc
  filesystems
* Fixed bug in xattr code treating positive return of
  ext4_journal_ensure_credits() as error
* Fixed preexisting bug in ext4_evict_inode() where we reserved too few credits
* Added trace point to jbd2_journal_restart()
* Fix some kernel doc bugs
* Rebased on top of 5.4-rc1

Changes since v1:
* Reordered some patches to reduce code churn
* Computation in jbd2_revoke_descriptors_per_block() was too early - moved it
  to later when journal superblock is loaded and so the feature checking
  actually works.
* Made sure nobody outside JBD2 uses handle->h_buffer_credits since now it
  contains also credits for revoke descriptors and it was confusing come users.
* Updated cover letter with more details about reproducer

Original cover letter:

I've recently got a bug report where JBD2 assertion failed due to
transaction commit running out of journal space. After closer inspection of
the crash dump it seems that the problem is that there were too many
journal descriptor blocks (more that max_transaction_size >> 5 + 32 we
estimate in jbd2_log_space_left()) due to descriptor blocks with revoke
records. In fact the estimate on the number of descriptor blocks looks
pretty arbitrary and there can be much more descriptor blocks needed for
revoke records. We need one revoke record for every metadata block freed.
So in the worst case (1k blocksize, 64-bit journal feature enabled,
checksumming enabled) we fit 125 revoke record in one descriptor block.  In
common cases its about 500 revoke records per descriptor block. Now when
we free large directories or large file with data journalling enabled, we can
have *lots* of blocks to revoke - with extent mapped files easily millions
in a single transaction which can mean 10k descriptor blocks - clearly more
than the estimate of 128 descriptor blocks per transaction ;)

This patch series aims at fixing the problem by accounting descriptor blocks
into transaction credits and reserving appropriate amount of credits for revoke
descriptors on transaction handle start. Similar to normal transaction credits,
the filesystem has to provide estimate for the number of blocks it is going
to revoke using the transaction handle so that credits for revoke descriptors
can be reserved.

The series has survived fstests in couple configurations and also the stress
test like:
  Create filesystem with 1KB blocksize and journal size 32MB
  Mount the filesystem with -o nodelalloc
  for (( i = 0; i < 4; i++ )); do
    dd if=/dev/zero of=file$i bs=1M count=2048 conv=fsync
    chattr +j file$i
  done
  for (( i = 0; i < 4; i++ )); do
    rm file$i&
  done

which reliably triggers the assertion failure in JBD2 on unpatched kernel.

Review and comments are welcome :).

								Honza
Previous versions:
Link: http://lore.kernel.org/r/20190927111536.16455-1-jack@suse.cz
Link: http://lore.kernel.org/r/20190930103544.11479-1-jack@suse.cz

^ permalink raw reply	[flat|nested] 101+ messages in thread

* [PATCH 01/25] jbd2: Fix possible overflow in jbd2_log_space_left()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (24 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 0/25 " Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 02/25] jbd2: Fixup stale comment in commit code Jan Kara
                   ` (25 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara, stable

When number of free space in the journal is very low, the arithmetic in
jbd2_log_space_left() could underflow resulting in very high number of
free blocks and thus triggering assertion failure in transaction commit
code complaining there's not enough space in the journal:

J_ASSERT(journal->j_free > 1);

Properly check for the low number of free blocks.

CC: stable@vger.kernel.org
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/jbd2.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 603fbc4e2f70..10e6049c0ba9 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1582,7 +1582,7 @@ static inline int jbd2_space_needed(journal_t *journal)
 static inline unsigned long jbd2_log_space_left(journal_t *journal)
 {
 	/* Allow for rounding errors */
-	unsigned long free = journal->j_free - 32;
+	long free = journal->j_free - 32;
 
 	if (journal->j_committing_transaction) {
 		unsigned long committing = atomic_read(&journal->
@@ -1591,7 +1591,7 @@ static inline unsigned long jbd2_log_space_left(journal_t *journal)
 		/* Transaction + control blocks */
 		free -= committing + (committing >> JBD2_CONTROL_BLOCKS_SHIFT);
 	}
-	return free;
+	return max_t(long, free, 0);
 }
 
 /*
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 02/25] jbd2: Fixup stale comment in commit code
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (25 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 01/25] jbd2: Fix possible overflow in jbd2_log_space_left() Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 03/25] jbd2: Completely fill journal descriptor blocks Jan Kara
                   ` (24 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

jbd2_journal_next_log_block() does not look at
transaction->t_outstanding_credits. Remove the misleading comment.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/commit.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 132fb92098c7..c6d39f2ad828 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -642,8 +642,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 
 		/*
 		 * start_this_handle() uses t_outstanding_credits to determine
-		 * the free space in the log, but this counter is changed
-		 * by jbd2_journal_next_log_block() also.
+		 * the free space in the log.
 		 */
 		atomic_dec(&commit_transaction->t_outstanding_credits);
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 03/25] jbd2: Completely fill journal descriptor blocks
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (26 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 02/25] jbd2: Fixup stale comment in commit code Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 04/25] ext4: Move marking of handle as sync to ext4_add_nondir() Jan Kara
                   ` (23 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

With 32-bit block numbers, we don't allocate the array for journal
buffer heads large enough for corresponding descriptor tags to fill the
descriptor block. Thus we end up writing out half-full descriptor blocks
to the journal unnecessarily growing the transaction. Fix the logic to
allocate the array large enough.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/journal.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 1c58859aa592..cc11097f1176 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1098,6 +1098,16 @@ static void jbd2_stats_proc_exit(journal_t *journal)
 	remove_proc_entry(journal->j_devname, proc_jbd2_stats);
 }
 
+/* Minimum size of descriptor tag */
+static int jbd2_min_tag_size(void)
+{
+	/*
+	 * Tag with 32-bit block numbers does not use last four bytes of the
+	 * structure
+	 */
+	return sizeof(journal_block_tag_t) - 4;
+}
+
 /*
  * Management for journal control blocks: functions to create and
  * destroy journal_t structures, and to initialise and read existing
@@ -1156,7 +1166,8 @@ static journal_t *journal_init_common(struct block_device *bdev,
 	journal->j_fs_dev = fs_dev;
 	journal->j_blk_offset = start;
 	journal->j_maxlen = len;
-	n = journal->j_blocksize / sizeof(journal_block_tag_t);
+	/* We need enough buffers to write out full descriptor block. */
+	n = journal->j_blocksize / jbd2_min_tag_size();
 	journal->j_wbufsize = n;
 	journal->j_wbuf = kmalloc_array(n, sizeof(struct buffer_head *),
 					GFP_KERNEL);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 04/25] ext4: Move marking of handle as sync to ext4_add_nondir()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (27 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 03/25] jbd2: Completely fill journal descriptor blocks Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 05/25] ext4: Do not iput inode under running transaction Jan Kara
                   ` (22 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Every caller of ext4_add_nondir() marks handle as sync if directory has
DIRSYNC set. Move this marking to ext4_add_nondir() so reduce some
duplication.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/namei.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index a427d2031a8d..97cf1c8b56b2 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2550,9 +2550,12 @@ static void ext4_dec_count(handle_t *handle, struct inode *inode)
 static int ext4_add_nondir(handle_t *handle,
 		struct dentry *dentry, struct inode *inode)
 {
+	struct inode *dir = d_inode(dentry->d_parent);
 	int err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
 		ext4_mark_inode_dirty(handle, inode);
+		if (IS_DIRSYNC(dir))
+			ext4_handle_sync(handle);
 		d_instantiate_new(dentry, inode);
 		return 0;
 	}
@@ -2593,8 +2596,6 @@ static int ext4_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
 		err = ext4_add_nondir(handle, dentry, inode);
-		if (!err && IS_DIRSYNC(dir))
-			ext4_handle_sync(handle);
 	}
 	if (handle)
 		ext4_journal_stop(handle);
@@ -2625,8 +2626,6 @@ static int ext4_mknod(struct inode *dir, struct dentry *dentry,
 		init_special_inode(inode, inode->i_mode, rdev);
 		inode->i_op = &ext4_special_inode_operations;
 		err = ext4_add_nondir(handle, dentry, inode);
-		if (!err && IS_DIRSYNC(dir))
-			ext4_handle_sync(handle);
 	}
 	if (handle)
 		ext4_journal_stop(handle);
@@ -3329,9 +3328,6 @@ static int ext4_symlink(struct inode *dir,
 	}
 	EXT4_I(inode)->i_disksize = inode->i_size;
 	err = ext4_add_nondir(handle, dentry, inode);
-	if (!err && IS_DIRSYNC(dir))
-		ext4_handle_sync(handle);
-
 	if (handle)
 		ext4_journal_stop(handle);
 	goto out_free_encrypted_link;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 05/25] ext4: Do not iput inode under running transaction
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (28 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 04/25] ext4: Move marking of handle as sync to ext4_add_nondir() Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 06/25] ext4: Fix credit estimate for final inode freeing Jan Kara
                   ` (21 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara, stable

When ext4_mkdir(), ext4_symlink(), ext4_create(), or ext4_mknod() fail
to add entry into directory, it ends up dropping freshly created inode
under the running transaction and thus inode truncation happens under
that transaction. That breaks assumptions that evict() does not get
called from a transaction context and at least in ext4_symlink() case it
can result in inode eviction deadlocking in inode_wait_for_writeback()
when flush worker finds symlink inode, starts to write it back and
blocks on starting a transaction. So change the code in ext4_mkdir() and
ext4_add_nondir() to drop inode reference only after the transaction is
stopped. We also have to add inode to the orphan list in that case as
otherwise the inode would get leaked in case we crash before inode
deletion is committed.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/namei.c | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 97cf1c8b56b2..a67cae3c8ff5 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2547,21 +2547,29 @@ static void ext4_dec_count(handle_t *handle, struct inode *inode)
 }
 
 
+/*
+ * Add non-directory inode to a directory. On success, the inode reference is
+ * consumed by dentry is instantiation. This is also indicated by clearing of
+ * *inodep pointer. On failure, the caller is responsible for dropping the
+ * inode reference in the safe context.
+ */
 static int ext4_add_nondir(handle_t *handle,
-		struct dentry *dentry, struct inode *inode)
+		struct dentry *dentry, struct inode **inodep)
 {
 	struct inode *dir = d_inode(dentry->d_parent);
+	struct inode *inode = *inodep;
 	int err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
 		ext4_mark_inode_dirty(handle, inode);
 		if (IS_DIRSYNC(dir))
 			ext4_handle_sync(handle);
 		d_instantiate_new(dentry, inode);
+		*inodep = NULL;
 		return 0;
 	}
 	drop_nlink(inode);
+	ext4_orphan_add(handle, inode);
 	unlock_new_inode(inode);
-	iput(inode);
 	return err;
 }
 
@@ -2595,10 +2603,12 @@ static int ext4_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 		inode->i_op = &ext4_file_inode_operations;
 		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
-		err = ext4_add_nondir(handle, dentry, inode);
+		err = ext4_add_nondir(handle, dentry, &inode);
 	}
 	if (handle)
 		ext4_journal_stop(handle);
+	if (!IS_ERR_OR_NULL(inode))
+		iput(inode);
 	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
 		goto retry;
 	return err;
@@ -2625,10 +2635,12 @@ static int ext4_mknod(struct inode *dir, struct dentry *dentry,
 	if (!IS_ERR(inode)) {
 		init_special_inode(inode, inode->i_mode, rdev);
 		inode->i_op = &ext4_special_inode_operations;
-		err = ext4_add_nondir(handle, dentry, inode);
+		err = ext4_add_nondir(handle, dentry, &inode);
 	}
 	if (handle)
 		ext4_journal_stop(handle);
+	if (!IS_ERR_OR_NULL(inode))
+		iput(inode);
 	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
 		goto retry;
 	return err;
@@ -2778,10 +2790,12 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 	if (err) {
 out_clear_inode:
 		clear_nlink(inode);
+		ext4_orphan_add(handle, inode);
 		unlock_new_inode(inode);
 		ext4_mark_inode_dirty(handle, inode);
+		ext4_journal_stop(handle);
 		iput(inode);
-		goto out_stop;
+		goto out_retry;
 	}
 	ext4_inc_count(handle, dir);
 	ext4_update_dx_flag(dir);
@@ -2795,6 +2809,7 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 out_stop:
 	if (handle)
 		ext4_journal_stop(handle);
+out_retry:
 	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
 		goto retry;
 	return err;
@@ -3327,9 +3342,11 @@ static int ext4_symlink(struct inode *dir,
 		inode->i_size = disk_link.len - 1;
 	}
 	EXT4_I(inode)->i_disksize = inode->i_size;
-	err = ext4_add_nondir(handle, dentry, inode);
+	err = ext4_add_nondir(handle, dentry, &inode);
 	if (handle)
 		ext4_journal_stop(handle);
+	if (inode)
+		iput(inode);
 	goto out_free_encrypted_link;
 
 err_drop_inode:
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 06/25] ext4: Fix credit estimate for final inode freeing
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (29 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 05/25] ext4: Do not iput inode under running transaction Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 21:00   ` Theodore Y. Ts'o
  2019-11-05 16:44 ` [PATCH 07/25] ext4: Fix ext4_should_journal_data() for EA inodes Jan Kara
                   ` (20 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara, stable

Estimate for the number of credits needed for final freeing of inode in
ext4_evict_inode() was to small. We may modify 4 blocks (inode & sb for
orphan deletion, bitmap & group descriptor for inode freeing) and not
just 3.

Fixes: e50e5129f384 ("ext4: xattr-in-inode support")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 516faa280ced..81bc2fb23c40 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -196,7 +196,12 @@ void ext4_evict_inode(struct inode *inode)
 {
 	handle_t *handle;
 	int err;
-	int extra_credits = 3;
+	/*
+	 * Credits for final inode cleanup and freeing:
+	 * sb + inode (ext4_orphan_del()), block bitmap, group descriptor
+	 * (xattr block freeing), bitmap, group descriptor (inode freeing)
+	 */
+	int extra_credits = 6;
 	struct ext4_xattr_inode_array *ea_inode_array = NULL;
 
 	trace_ext4_evict_inode(inode);
@@ -252,8 +257,12 @@ void ext4_evict_inode(struct inode *inode)
 	if (!IS_NOQUOTA(inode))
 		extra_credits += EXT4_MAXQUOTAS_DEL_BLOCKS(inode->i_sb);
 
+	/*
+	 * Block bitmap, group descriptor, and inode are accounted in both
+ 	 * ext4_blocks_for_truncate() and extra_credits. So subtract 3.
+	 */
 	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE,
-				 ext4_blocks_for_truncate(inode)+extra_credits);
+			 ext4_blocks_for_truncate(inode) + extra_credits - 3);
 	if (IS_ERR(handle)) {
 		ext4_std_error(inode->i_sb, PTR_ERR(handle));
 		/*
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 07/25] ext4: Fix ext4_should_journal_data() for EA inodes
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (30 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 06/25] ext4: Fix credit estimate for final inode freeing Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 08/25] ext4: Use ext4_journal_extend() instead of jbd2_journal_extend() Jan Kara
                   ` (19 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Similarly to directories, EA inodes do only journalled modifications to
their data. Change ext4_should_journal_data() to return true for them so
that we don't have to special-case them during truncate.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4_jbd2.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index ef8fcf7d0d3b..99fe72522960 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -407,6 +407,7 @@ static inline int ext4_inode_journal_mode(struct inode *inode)
 		return EXT4_INODE_WRITEBACK_DATA_MODE;	/* writeback */
 	/* We do not support data journalling with delayed allocation */
 	if (!S_ISREG(inode->i_mode) ||
+	    ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE) ||
 	    test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA ||
 	    (ext4_test_inode_flag(inode, EXT4_INODE_JOURNAL_DATA) &&
 	    !test_opt(inode->i_sb, DELALLOC))) {
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 08/25] ext4: Use ext4_journal_extend() instead of jbd2_journal_extend()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (31 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 07/25] ext4: Fix ext4_should_journal_data() for EA inodes Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 09/25] ext4: Avoid unnecessary revokes in ext4_alloc_branch() Jan Kara
                   ` (18 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Use ext4 helper ext4_journal_extend() instead of opencoding it in
ext4_try_to_expand_extra_isize().

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 81bc2fb23c40..facc5ddb4d75 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5974,8 +5974,7 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 	 * If this is felt to be critical, then e2fsck should be run to
 	 * force a large enough s_min_extra_isize.
 	 */
-	if (ext4_handle_valid(handle) &&
-	    jbd2_journal_extend(handle,
+	if (ext4_journal_extend(handle,
 				EXT4_DATA_TRANS_BLOCKS(inode->i_sb)) != 0)
 		return -ENOSPC;
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 09/25] ext4: Avoid unnecessary revokes in ext4_alloc_branch()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (32 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 08/25] ext4: Use ext4_journal_extend() instead of jbd2_journal_extend() Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 10/25] ext4: Provide function to handle transaction restarts Jan Kara
                   ` (17 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Error cleanup path in ext4_alloc_branch() calls ext4_forget() on freshly
allocated indirect blocks with 'metadata' set to 1. This results in
generating revoke records for these blocks. However this is unnecessary
as the freed blocks are only allocated in the current transaction and
thus they will never be journalled. Make this cleanup path similar to
e.g. cleanup in ext4_splice_branch() and use ext4_free_blocks() to
handle block forgetting by passing EXT4_FREE_BLOCKS_FORGET and not
EXT4_FREE_BLOCKS_METADATA to ext4_free_blocks(). This also allows
allocating transaction not to reserve any credits for revoke records.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/indirect.c | 28 +++++++++++++++++++---------
 1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 36699a131168..602abae08387 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -331,11 +331,14 @@ static int ext4_alloc_branch(handle_t *handle,
 	for (i = 0; i <= indirect_blks; i++) {
 		if (i == indirect_blks) {
 			new_blocks[i] = ext4_mb_new_blocks(handle, ar, &err);
-		} else
+		} else {
 			ar->goal = new_blocks[i] = ext4_new_meta_blocks(handle,
 					ar->inode, ar->goal,
 					ar->flags & EXT4_MB_DELALLOC_RESERVED,
 					NULL, &err);
+			/* Simplify error cleanup... */
+			branch[i+1].bh = NULL;
+		}
 		if (err) {
 			i--;
 			goto failed;
@@ -377,18 +380,25 @@ static int ext4_alloc_branch(handle_t *handle,
 	}
 	return 0;
 failed:
+	if (i == indirect_blks) {
+		/* Free data blocks */
+		ext4_free_blocks(handle, ar->inode, NULL, new_blocks[i],
+				 ar->len, 0);
+		i--;
+	}
 	for (; i >= 0; i--) {
 		/*
 		 * We want to ext4_forget() only freshly allocated indirect
-		 * blocks.  Buffer for new_blocks[i-1] is at branch[i].bh and
-		 * buffer at branch[0].bh is indirect block / inode already
-		 * existing before ext4_alloc_branch() was called.
+		 * blocks. Buffer for new_blocks[i] is at branch[i+1].bh
+		 * (buffer at branch[0].bh is indirect block / inode already
+		 * existing before ext4_alloc_branch() was called). Also
+		 * because blocks are freshly allocated, we don't need to
+		 * revoke them which is why we don't set
+		 * EXT4_FREE_BLOCKS_METADATA.
 		 */
-		if (i > 0 && i != indirect_blks && branch[i].bh)
-			ext4_forget(handle, 1, ar->inode, branch[i].bh,
-				    branch[i].bh->b_blocknr);
-		ext4_free_blocks(handle, ar->inode, NULL, new_blocks[i],
-				 (i == indirect_blks) ? ar->len : 1, 0);
+		ext4_free_blocks(handle, ar->inode, branch[i+1].bh,
+				 new_blocks[i], 1,
+				 branch[i+1].bh ? EXT4_FREE_BLOCKS_FORGET : 0);
 	}
 	return err;
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 10/25] ext4: Provide function to handle transaction restarts
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (33 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 09/25] ext4: Avoid unnecessary revokes in ext4_alloc_branch() Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 11/25] ext4, jbd2: Provide accessor function for handle credits Jan Kara
                   ` (16 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Provide ext4_journal_ensure_credits_fn() function to ensure transaction
has given amount of credits and call helper function to prepare for
restarting a transaction. This allows to remove some boilerplate code
from various places, add proper error handling for the case where
transaction extension or restart fails, and reduces following changes
needed for proper revoke record reservation tracking.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h      |  4 ++-
 fs/ext4/ext4_jbd2.c | 11 +++++++
 fs/ext4/ext4_jbd2.h | 48 +++++++++++++++++++++++++++
 fs/ext4/extents.c   | 68 ++++++++++++++++++++++----------------
 fs/ext4/indirect.c  | 93 +++++++++++++++++++++++++++++----------------------
 fs/ext4/inode.c     | 26 ---------------
 fs/ext4/migrate.c   | 95 ++++++++++++++++++++---------------------------------
 fs/ext4/resize.c    | 46 ++++++--------------------
 fs/ext4/xattr.c     | 90 +++++++++++++++++++-------------------------------
 9 files changed, 234 insertions(+), 247 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 03db3e71676c..67a6fcc11182 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2604,7 +2604,6 @@ extern int ext4_can_truncate(struct inode *inode);
 extern int ext4_truncate(struct inode *);
 extern int ext4_break_layouts(struct inode *);
 extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length);
-extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int nblocks);
 extern void ext4_set_inode_flags(struct inode *);
 extern int ext4_alloc_da_blocks(struct inode *inode);
 extern void ext4_set_aops(struct inode *inode);
@@ -3296,6 +3295,9 @@ extern int ext4_swap_extents(handle_t *handle, struct inode *inode1,
 			     ext4_lblk_t lblk2,  ext4_lblk_t count,
 			     int mark_unwritten,int *err);
 extern int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu);
+extern int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode,
+				       int check_cred, int restart_cred);
+
 
 /* move_extent.c */
 extern void ext4_double_down_write_data_sem(struct inode *first,
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 7c70b08d104c..2b98d893cda9 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -133,6 +133,17 @@ handle_t *__ext4_journal_start_reserved(handle_t *handle, unsigned int line,
 	return handle;
 }
 
+int __ext4_journal_ensure_credits(handle_t *handle, int check_cred,
+				  int extend_cred)
+{
+	if (!ext4_handle_valid(handle))
+		return 0;
+	if (handle->h_buffer_credits >= check_cred)
+		return 0;
+	return ext4_journal_extend(handle,
+				   extend_cred - handle->h_buffer_credits);
+}
+
 static void ext4_journal_abort_handle(const char *caller, unsigned int line,
 				      const char *err_fn,
 				      struct buffer_head *bh,
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 99fe72522960..1920b976eef1 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -346,6 +346,54 @@ static inline int ext4_journal_restart(handle_t *handle, int nblocks)
 	return 0;
 }
 
+int __ext4_journal_ensure_credits(handle_t *handle, int check_cred,
+				  int extend_cred);
+
+
+/*
+ * Ensure @handle has at least @check_creds credits available. If not,
+ * transaction will be extended or restarted to contain at least @extend_cred
+ * credits. Before restarting transaction @fn is executed to allow for cleanup
+ * before the transaction is restarted.
+ *
+ * The return value is < 0 in case of error, 0 in case the handle has enough
+ * credits or transaction extension succeeded, 1 in case transaction had to be
+ * restarted.
+ */
+#define ext4_journal_ensure_credits_fn(handle, check_cred, extend_cred, fn) \
+({									\
+	__label__ __ensure_end;						\
+	int err = __ext4_journal_ensure_credits((handle), (check_cred),	\
+						(extend_cred));		\
+									\
+	if (err <= 0)							\
+		goto __ensure_end;					\
+	err = (fn);							\
+	if (err < 0)							\
+		goto __ensure_end;					\
+	err = ext4_journal_restart((handle), (extend_cred));		\
+	if (err == 0)							\
+		err = 1;						\
+__ensure_end:								\
+	err;								\
+})
+
+/*
+ * Ensure given handle has at least requested amount of credits available,
+ * possibly restarting transaction if needed.
+ */
+static inline int ext4_journal_ensure_credits(handle_t *handle, int credits)
+{
+	return ext4_journal_ensure_credits_fn(handle, credits, credits, 0);
+}
+
+static inline int ext4_journal_ensure_credits_batch(handle_t *handle,
+						    int credits)
+{
+	return ext4_journal_ensure_credits_fn(handle, credits,
+					      EXT4_MAX_TRANS_DATA, 0);
+}
+
 static inline int ext4_journal_blocks_per_page(struct inode *inode)
 {
 	if (EXT4_JOURNAL(inode) != NULL)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index fb0f99dc8c22..32f2c22c7ef2 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -100,29 +100,40 @@ static int ext4_split_extent_at(handle_t *handle,
 static int ext4_find_delayed_extent(struct inode *inode,
 				    struct extent_status *newes);
 
-static int ext4_ext_truncate_extend_restart(handle_t *handle,
-					    struct inode *inode,
-					    int needed)
+static int ext4_ext_trunc_restart_fn(struct inode *inode, int *dropped)
 {
-	int err;
-
-	if (!ext4_handle_valid(handle))
-		return 0;
-	if (handle->h_buffer_credits >= needed)
-		return 0;
 	/*
-	 * If we need to extend the journal get a few extra blocks
-	 * while we're at it for efficiency's sake.
+	 * Drop i_data_sem to avoid deadlock with ext4_map_blocks.  At this
+	 * moment, get_block can be called only for blocks inside i_size since
+	 * page cache has been already dropped and writes are blocked by
+	 * i_mutex. So we can safely drop the i_data_sem here.
 	 */
-	needed += 3;
-	err = ext4_journal_extend(handle, needed - handle->h_buffer_credits);
-	if (err <= 0)
-		return err;
-	err = ext4_truncate_restart_trans(handle, inode, needed);
-	if (err == 0)
-		err = -EAGAIN;
+	BUG_ON(EXT4_JOURNAL(inode) == NULL);
+	ext4_discard_preallocations(inode);
+	up_write(&EXT4_I(inode)->i_data_sem);
+	*dropped = 1;
+	return 0;
+}
 
-	return err;
+/*
+ * Make sure 'handle' has at least 'check_cred' credits. If not, restart
+ * transaction with 'restart_cred' credits. The function drops i_data_sem
+ * when restarting transaction and gets it after transaction is restarted.
+ *
+ * The function returns 0 on success, 1 if transaction had to be restarted,
+ * and < 0 in case of fatal error.
+ */
+int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode,
+				int check_cred, int restart_cred)
+{
+	int ret;
+	int dropped = 0;
+
+	ret = ext4_journal_ensure_credits_fn(handle, check_cred, restart_cred,
+			ext4_ext_trunc_restart_fn(inode, &dropped));
+	if (dropped)
+		down_write(&EXT4_I(inode)->i_data_sem);
+	return ret;
 }
 
 /*
@@ -2820,9 +2831,13 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
 		}
 		credits += EXT4_MAXQUOTAS_TRANS_BLOCKS(inode->i_sb);
 
-		err = ext4_ext_truncate_extend_restart(handle, inode, credits);
-		if (err)
+		err = ext4_datasem_ensure_credits(handle, inode, credits,
+						  credits);
+		if (err) {
+			if (err > 0)
+				err = -EAGAIN;
 			goto out;
+		}
 
 		err = ext4_ext_get_access(handle, inode, path + depth);
 		if (err)
@@ -5206,13 +5221,10 @@ ext4_access_path(handle_t *handle, struct inode *inode,
 	 * descriptor) for each block group; assume two block
 	 * groups
 	 */
-	if (handle->h_buffer_credits < 7) {
-		credits = ext4_writepage_trans_blocks(inode);
-		err = ext4_ext_truncate_extend_restart(handle, inode, credits);
-		/* EAGAIN is success */
-		if (err && err != -EAGAIN)
-			return err;
-	}
+	credits = ext4_writepage_trans_blocks(inode);
+	err = ext4_datasem_ensure_credits(handle, inode, 7, credits);
+	if (err < 0)
+		return err;
 
 	err = ext4_ext_get_access(handle, inode, path);
 	return err;
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 602abae08387..63e1d5846442 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -699,27 +699,62 @@ int ext4_ind_trans_blocks(struct inode *inode, int nrblocks)
 	return DIV_ROUND_UP(nrblocks, EXT4_ADDR_PER_BLOCK(inode->i_sb)) + 4;
 }
 
+static int ext4_ind_trunc_restart_fn(handle_t *handle, struct inode *inode,
+				     struct buffer_head *bh, int *dropped)
+{
+	int err;
+
+	if (bh) {
+		BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
+		err = ext4_handle_dirty_metadata(handle, inode, bh);
+		if (unlikely(err))
+			return err;
+	}
+	err = ext4_mark_inode_dirty(handle, inode);
+	if (unlikely(err))
+		return err;
+	/*
+	 * Drop i_data_sem to avoid deadlock with ext4_map_blocks.  At this
+	 * moment, get_block can be called only for blocks inside i_size since
+	 * page cache has been already dropped and writes are blocked by
+	 * i_mutex. So we can safely drop the i_data_sem here.
+	 */
+	BUG_ON(EXT4_JOURNAL(inode) == NULL);
+	ext4_discard_preallocations(inode);
+	up_write(&EXT4_I(inode)->i_data_sem);
+	*dropped = 1;
+	return 0;
+}
+
 /*
  * Truncate transactions can be complex and absolutely huge.  So we need to
  * be able to restart the transaction at a conventient checkpoint to make
  * sure we don't overflow the journal.
  *
  * Try to extend this transaction for the purposes of truncation.  If
- * extend fails, we need to propagate the failure up and restart the
- * transaction in the top-level truncate loop. --sct
- *
- * Returns 0 if we managed to create more room.  If we can't create more
- * room, and the transaction must be restarted we return 1.
+ * extend fails, we restart transaction.
  */
-static int try_to_extend_transaction(handle_t *handle, struct inode *inode)
+static int ext4_ind_truncate_ensure_credits(handle_t *handle,
+					    struct inode *inode,
+					    struct buffer_head *bh)
 {
-	if (!ext4_handle_valid(handle))
-		return 0;
-	if (ext4_handle_has_enough_credits(handle, EXT4_RESERVE_TRANS_BLOCKS+1))
-		return 0;
-	if (!ext4_journal_extend(handle, ext4_blocks_for_truncate(inode)))
-		return 0;
-	return 1;
+	int ret;
+	int dropped = 0;
+
+	ret = ext4_journal_ensure_credits_fn(handle, EXT4_RESERVE_TRANS_BLOCKS,
+			ext4_blocks_for_truncate(inode),
+			ext4_ind_trunc_restart_fn(handle, inode, bh, &dropped));
+	if (dropped)
+		down_write(&EXT4_I(inode)->i_data_sem);
+	if (ret <= 0)
+		return ret;
+	if (bh) {
+		BUFFER_TRACE(bh, "retaking write access");
+		ret = ext4_journal_get_write_access(handle, bh);
+		if (unlikely(ret))
+			return ret;
+	}
+	return 0;
 }
 
 /*
@@ -854,27 +889,9 @@ static int ext4_clear_blocks(handle_t *handle, struct inode *inode,
 		return 1;
 	}
 
-	if (try_to_extend_transaction(handle, inode)) {
-		if (bh) {
-			BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
-			err = ext4_handle_dirty_metadata(handle, inode, bh);
-			if (unlikely(err))
-				goto out_err;
-		}
-		err = ext4_mark_inode_dirty(handle, inode);
-		if (unlikely(err))
-			goto out_err;
-		err = ext4_truncate_restart_trans(handle, inode,
-					ext4_blocks_for_truncate(inode));
-		if (unlikely(err))
-			goto out_err;
-		if (bh) {
-			BUFFER_TRACE(bh, "retaking write access");
-			err = ext4_journal_get_write_access(handle, bh);
-			if (unlikely(err))
-				goto out_err;
-		}
-	}
+	err = ext4_ind_truncate_ensure_credits(handle, inode, bh);
+	if (err < 0)
+		goto out_err;
 
 	for (p = first; p < last; p++)
 		*p = 0;
@@ -1057,11 +1074,9 @@ static void ext4_free_branches(handle_t *handle, struct inode *inode,
 			 */
 			if (ext4_handle_is_aborted(handle))
 				return;
-			if (try_to_extend_transaction(handle, inode)) {
-				ext4_mark_inode_dirty(handle, inode);
-				ext4_truncate_restart_trans(handle, inode,
-					    ext4_blocks_for_truncate(inode));
-			}
+			if (ext4_ind_truncate_ensure_credits(handle, inode,
+							     NULL) < 0)
+				return;
 
 			/*
 			 * The forget flag here is critical because if
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index facc5ddb4d75..e346b5171f5a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -163,32 +163,6 @@ int ext4_inode_is_fast_symlink(struct inode *inode)
 	       (inode->i_size < EXT4_N_BLOCKS * 4);
 }
 
-/*
- * Restart the transaction associated with *handle.  This does a commit,
- * so before we call here everything must be consistently dirtied against
- * this transaction.
- */
-int ext4_truncate_restart_trans(handle_t *handle, struct inode *inode,
-				 int nblocks)
-{
-	int ret;
-
-	/*
-	 * Drop i_data_sem to avoid deadlock with ext4_map_blocks.  At this
-	 * moment, get_block can be called only for blocks inside i_size since
-	 * page cache has been already dropped and writes are blocked by
-	 * i_mutex. So we can safely drop the i_data_sem here.
-	 */
-	BUG_ON(EXT4_JOURNAL(inode) == NULL);
-	jbd_debug(2, "restarting handle %p\n", handle);
-	up_write(&EXT4_I(inode)->i_data_sem);
-	ret = ext4_journal_restart(handle, nblocks);
-	down_write(&EXT4_I(inode)->i_data_sem);
-	ext4_discard_preallocations(inode);
-
-	return ret;
-}
-
 /*
  * Called at the last iput() if i_nlink is zero.
  */
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index b1e4d359f73b..65f09dc9d941 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -50,29 +50,9 @@ static int finish_range(handle_t *handle, struct inode *inode,
 	needed = ext4_ext_calc_credits_for_single_extent(inode,
 		    lb->last_block - lb->first_block + 1, path);
 
-	/*
-	 * Make sure the credit we accumalated is not really high
-	 */
-	if (needed && ext4_handle_has_enough_credits(handle,
-						EXT4_RESERVE_TRANS_BLOCKS)) {
-		up_write((&EXT4_I(inode)->i_data_sem));
-		retval = ext4_journal_restart(handle, needed);
-		down_write((&EXT4_I(inode)->i_data_sem));
-		if (retval)
-			goto err_out;
-	} else if (needed) {
-		retval = ext4_journal_extend(handle, needed);
-		if (retval) {
-			/*
-			 * IF not able to extend the journal restart the journal
-			 */
-			up_write((&EXT4_I(inode)->i_data_sem));
-			retval = ext4_journal_restart(handle, needed);
-			down_write((&EXT4_I(inode)->i_data_sem));
-			if (retval)
-				goto err_out;
-		}
-	}
+	retval = ext4_datasem_ensure_credits(handle, inode, needed, needed);
+	if (retval < 0)
+		goto err_out;
 	retval = ext4_ext_insert_extent(handle, inode, &path, &newext, 0);
 err_out:
 	up_write((&EXT4_I(inode)->i_data_sem));
@@ -196,26 +176,6 @@ static int update_tind_extent_range(handle_t *handle, struct inode *inode,
 
 }
 
-static int extend_credit_for_blkdel(handle_t *handle, struct inode *inode)
-{
-	int retval = 0, needed;
-
-	if (ext4_handle_has_enough_credits(handle, EXT4_RESERVE_TRANS_BLOCKS+1))
-		return 0;
-	/*
-	 * We are freeing a blocks. During this we touch
-	 * superblock, group descriptor and block bitmap.
-	 * So allocate a credit of 3. We may update
-	 * quota (user and group).
-	 */
-	needed = 3 + EXT4_MAXQUOTAS_TRANS_BLOCKS(inode->i_sb);
-
-	if (ext4_journal_extend(handle, needed) != 0)
-		retval = ext4_journal_restart(handle, needed);
-
-	return retval;
-}
-
 static int free_dind_blocks(handle_t *handle,
 				struct inode *inode, __le32 i_data)
 {
@@ -223,6 +183,7 @@ static int free_dind_blocks(handle_t *handle,
 	__le32 *tmp_idata;
 	struct buffer_head *bh;
 	unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
+	int err;
 
 	bh = ext4_sb_bread(inode->i_sb, le32_to_cpu(i_data), 0);
 	if (IS_ERR(bh))
@@ -231,7 +192,12 @@ static int free_dind_blocks(handle_t *handle,
 	tmp_idata = (__le32 *)bh->b_data;
 	for (i = 0; i < max_entries; i++) {
 		if (tmp_idata[i]) {
-			extend_credit_for_blkdel(handle, inode);
+			err = ext4_journal_ensure_credits(handle,
+						EXT4_RESERVE_TRANS_BLOCKS);
+			if (err < 0) {
+				put_bh(bh);
+				return err;
+			}
 			ext4_free_blocks(handle, inode, NULL,
 					 le32_to_cpu(tmp_idata[i]), 1,
 					 EXT4_FREE_BLOCKS_METADATA |
@@ -239,7 +205,9 @@ static int free_dind_blocks(handle_t *handle,
 		}
 	}
 	put_bh(bh);
-	extend_credit_for_blkdel(handle, inode);
+	err = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS);
+	if (err < 0)
+		return err;
 	ext4_free_blocks(handle, inode, NULL, le32_to_cpu(i_data), 1,
 			 EXT4_FREE_BLOCKS_METADATA |
 			 EXT4_FREE_BLOCKS_FORGET);
@@ -270,7 +238,9 @@ static int free_tind_blocks(handle_t *handle,
 		}
 	}
 	put_bh(bh);
-	extend_credit_for_blkdel(handle, inode);
+	retval = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS);
+	if (retval < 0)
+		return retval;
 	ext4_free_blocks(handle, inode, NULL, le32_to_cpu(i_data), 1,
 			 EXT4_FREE_BLOCKS_METADATA |
 			 EXT4_FREE_BLOCKS_FORGET);
@@ -283,7 +253,10 @@ static int free_ind_block(handle_t *handle, struct inode *inode, __le32 *i_data)
 
 	/* ei->i_data[EXT4_IND_BLOCK] */
 	if (i_data[0]) {
-		extend_credit_for_blkdel(handle, inode);
+		retval = ext4_journal_ensure_credits(handle,
+						     EXT4_RESERVE_TRANS_BLOCKS);
+		if (retval < 0)
+			return retval;
 		ext4_free_blocks(handle, inode, NULL,
 				le32_to_cpu(i_data[0]), 1,
 				 EXT4_FREE_BLOCKS_METADATA |
@@ -318,12 +291,9 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
 	 * One credit accounted for writing the
 	 * i_data field of the original inode
 	 */
-	retval = ext4_journal_extend(handle, 1);
-	if (retval) {
-		retval = ext4_journal_restart(handle, 1);
-		if (retval)
-			goto err_out;
-	}
+	retval = ext4_journal_ensure_credits(handle, 1);
+	if (retval < 0)
+		goto err_out;
 
 	i_data[0] = ei->i_data[EXT4_IND_BLOCK];
 	i_data[1] = ei->i_data[EXT4_DIND_BLOCK];
@@ -391,15 +361,19 @@ static int free_ext_idx(handle_t *handle, struct inode *inode,
 		ix = EXT_FIRST_INDEX(eh);
 		for (i = 0; i < le16_to_cpu(eh->eh_entries); i++, ix++) {
 			retval = free_ext_idx(handle, inode, ix);
-			if (retval)
-				break;
+			if (retval) {
+				put_bh(bh);
+				return retval;
+			}
 		}
 	}
 	put_bh(bh);
-	extend_credit_for_blkdel(handle, inode);
+	retval = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS);
+	if (retval < 0)
+		return retval;
 	ext4_free_blocks(handle, inode, NULL, block, 1,
 			 EXT4_FREE_BLOCKS_METADATA | EXT4_FREE_BLOCKS_FORGET);
-	return retval;
+	return 0;
 }
 
 /*
@@ -574,9 +548,9 @@ int ext4_ext_migrate(struct inode *inode)
 	}
 
 	/* We mark the tmp_inode dirty via ext4_ext_tree_init. */
-	if (ext4_journal_extend(handle, 1) != 0)
-		ext4_journal_restart(handle, 1);
-
+	retval = ext4_journal_ensure_credits(handle, 1);
+	if (retval < 0)
+		goto out_stop;
 	/*
 	 * Mark the tmp_inode as of size zero
 	 */
@@ -594,6 +568,7 @@ int ext4_ext_migrate(struct inode *inode)
 
 	/* Reset the extent details */
 	ext4_ext_tree_init(handle, tmp_inode);
+out_stop:
 	ext4_journal_stop(handle);
 out:
 	unlock_new_inode(tmp_inode);
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index c0e9aef376a7..3e4286b3901f 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -388,30 +388,6 @@ static struct buffer_head *bclean(handle_t *handle, struct super_block *sb,
 	return bh;
 }
 
-/*
- * If we have fewer than thresh credits, extend by EXT4_MAX_TRANS_DATA.
- * If that fails, restart the transaction & regain write access for the
- * buffer head which is used for block_bitmap modifications.
- */
-static int extend_or_restart_transaction(handle_t *handle, int thresh)
-{
-	int err;
-
-	if (ext4_handle_has_enough_credits(handle, thresh))
-		return 0;
-
-	err = ext4_journal_extend(handle, EXT4_MAX_TRANS_DATA);
-	if (err < 0)
-		return err;
-	if (err) {
-		err = ext4_journal_restart(handle, EXT4_MAX_TRANS_DATA);
-		if (err)
-			return err;
-	}
-
-	return 0;
-}
-
 /*
  * set_flexbg_block_bitmap() mark clusters [@first_cluster, @last_cluster] used.
  *
@@ -451,8 +427,8 @@ static int set_flexbg_block_bitmap(struct super_block *sb, handle_t *handle,
 			continue;
 		}
 
-		err = extend_or_restart_transaction(handle, 1);
-		if (err)
+		err = ext4_journal_ensure_credits_batch(handle, 1);
+		if (err < 0)
 			return err;
 
 		bh = sb_getblk(sb, flex_gd->groups[group].block_bitmap);
@@ -544,8 +520,8 @@ static int setup_new_flex_group_blocks(struct super_block *sb,
 			struct buffer_head *gdb;
 
 			ext4_debug("update backup group %#04llx\n", block);
-			err = extend_or_restart_transaction(handle, 1);
-			if (err)
+			err = ext4_journal_ensure_credits_batch(handle, 1);
+			if (err < 0)
 				goto out;
 
 			gdb = sb_getblk(sb, block);
@@ -602,8 +578,8 @@ static int setup_new_flex_group_blocks(struct super_block *sb,
 
 		/* Initialize block bitmap of the @group */
 		block = group_data[i].block_bitmap;
-		err = extend_or_restart_transaction(handle, 1);
-		if (err)
+		err = ext4_journal_ensure_credits_batch(handle, 1);
+		if (err < 0)
 			goto out;
 
 		bh = bclean(handle, sb, block);
@@ -631,8 +607,8 @@ static int setup_new_flex_group_blocks(struct super_block *sb,
 
 		/* Initialize inode bitmap of the @group */
 		block = group_data[i].inode_bitmap;
-		err = extend_or_restart_transaction(handle, 1);
-		if (err)
+		err = ext4_journal_ensure_credits_batch(handle, 1);
+		if (err < 0)
 			goto out;
 		/* Mark unused entries in inode bitmap used */
 		bh = bclean(handle, sb, block);
@@ -1109,10 +1085,8 @@ static void update_backups(struct super_block *sb, sector_t blk_off, char *data,
 		ext4_fsblk_t backup_block;
 
 		/* Out of journal space, and can't get more - abort - so sad */
-		if (ext4_handle_valid(handle) &&
-		    handle->h_buffer_credits == 0 &&
-		    ext4_journal_extend(handle, EXT4_MAX_TRANS_DATA) &&
-		    (err = ext4_journal_restart(handle, EXT4_MAX_TRANS_DATA)))
+		err = ext4_journal_ensure_credits_batch(handle, 1);
+		if (err < 0)
 			break;
 
 		if (meta_bg == 0)
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 491f9ee4040e..b79d8ffd3e9b 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -967,55 +967,6 @@ int __ext4_xattr_set_credits(struct super_block *sb, struct inode *inode,
 	return credits;
 }
 
-static int ext4_xattr_ensure_credits(handle_t *handle, struct inode *inode,
-				     int credits, struct buffer_head *bh,
-				     bool dirty, bool block_csum)
-{
-	int error;
-
-	if (!ext4_handle_valid(handle))
-		return 0;
-
-	if (handle->h_buffer_credits >= credits)
-		return 0;
-
-	error = ext4_journal_extend(handle, credits - handle->h_buffer_credits);
-	if (!error)
-		return 0;
-	if (error < 0) {
-		ext4_warning(inode->i_sb, "Extend journal (error %d)", error);
-		return error;
-	}
-
-	if (bh && dirty) {
-		if (block_csum)
-			ext4_xattr_block_csum_set(inode, bh);
-		error = ext4_handle_dirty_metadata(handle, NULL, bh);
-		if (error) {
-			ext4_warning(inode->i_sb, "Handle metadata (error %d)",
-				     error);
-			return error;
-		}
-	}
-
-	error = ext4_journal_restart(handle, credits);
-	if (error) {
-		ext4_warning(inode->i_sb, "Restart journal (error %d)", error);
-		return error;
-	}
-
-	if (bh) {
-		error = ext4_journal_get_write_access(handle, bh);
-		if (error) {
-			ext4_warning(inode->i_sb,
-				     "Get write access failed (error %d)",
-				     error);
-			return error;
-		}
-	}
-	return 0;
-}
-
 static int ext4_xattr_inode_update_ref(handle_t *handle, struct inode *ea_inode,
 				       int ref_change)
 {
@@ -1149,6 +1100,24 @@ static int ext4_xattr_inode_inc_ref_all(handle_t *handle, struct inode *parent,
 	return saved_err;
 }
 
+static int ext4_xattr_restart_fn(handle_t *handle, struct inode *inode,
+			struct buffer_head *bh, bool block_csum, bool dirty)
+{
+	int error;
+
+	if (bh && dirty) {
+		if (block_csum)
+			ext4_xattr_block_csum_set(inode, bh);
+		error = ext4_handle_dirty_metadata(handle, NULL, bh);
+		if (error) {
+			ext4_warning(inode->i_sb, "Handle metadata (error %d)",
+				     error);
+			return error;
+		}
+	}
+	return 0;
+}
+
 static void
 ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 			     struct buffer_head *bh,
@@ -1185,13 +1154,23 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 			continue;
 		}
 
-		err = ext4_xattr_ensure_credits(handle, parent, credits, bh,
-						dirty, block_csum);
-		if (err) {
+		err = ext4_journal_ensure_credits_fn(handle, credits, credits,
+			ext4_xattr_restart_fn(handle, parent, bh, block_csum,
+					      dirty));
+		if (err < 0) {
 			ext4_warning_inode(ea_inode, "Ensure credits err=%d",
 					   err);
 			continue;
 		}
+		if (err > 0) {
+			err = ext4_journal_get_write_access(handle, bh);
+			if (err) {
+				ext4_warning_inode(ea_inode,
+						"Re-get write access err=%d",
+						err);
+				continue;
+			}
+		}
 
 		err = ext4_xattr_inode_dec_ref(handle, ea_inode);
 		if (err) {
@@ -2862,11 +2841,8 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 	struct inode *ea_inode;
 	int error;
 
-	error = ext4_xattr_ensure_credits(handle, inode, extra_credits,
-					  NULL /* bh */,
-					  false /* dirty */,
-					  false /* block_csum */);
-	if (error) {
+	error = ext4_journal_ensure_credits(handle, extra_credits);
+	if (error < 0) {
 		EXT4_ERROR_INODE(inode, "ensure credits (error %d)", error);
 		goto cleanup;
 	}
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 11/25] ext4, jbd2: Provide accessor function for handle credits
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (34 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 10/25] ext4: Provide function to handle transaction restarts Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 12/25] ocfs2: Use accessor function for h_buffer_credits Jan Kara
                   ` (15 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Provide accessor function to get number of credits available in a handle
and use it from ext4. Later, computation of available credits won't be
so straightforward.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4_jbd2.c  | 13 +++++++------
 fs/ext4/ext4_jbd2.h  |  7 -------
 fs/ext4/xattr.c      |  2 +-
 include/linux/jbd2.h |  6 ++++++
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 2b98d893cda9..731bbfdbce5b 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -119,8 +119,8 @@ handle_t *__ext4_journal_start_reserved(handle_t *handle, unsigned int line,
 		return ext4_get_nojournal();
 
 	sb = handle->h_journal->j_private;
-	trace_ext4_journal_start_reserved(sb, handle->h_buffer_credits,
-					  _RET_IP_);
+	trace_ext4_journal_start_reserved(sb,
+				jbd2_handle_buffer_credits(handle), _RET_IP_);
 	err = ext4_journal_check_start(sb);
 	if (err < 0) {
 		jbd2_journal_free_reserved(handle);
@@ -138,10 +138,10 @@ int __ext4_journal_ensure_credits(handle_t *handle, int check_cred,
 {
 	if (!ext4_handle_valid(handle))
 		return 0;
-	if (handle->h_buffer_credits >= check_cred)
+	if (jbd2_handle_buffer_credits(handle) >= check_cred)
 		return 0;
 	return ext4_journal_extend(handle,
-				   extend_cred - handle->h_buffer_credits);
+			   extend_cred - jbd2_handle_buffer_credits(handle));
 }
 
 static void ext4_journal_abort_handle(const char *caller, unsigned int line,
@@ -289,7 +289,7 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
 				       handle->h_type,
 				       handle->h_line_no,
 				       handle->h_requested_credits,
-				       handle->h_buffer_credits, err);
+				       jbd2_handle_buffer_credits(handle), err);
 				return err;
 			}
 			ext4_error_inode(inode, where, line,
@@ -300,7 +300,8 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
 					 handle->h_type,
 					 handle->h_line_no,
 					 handle->h_requested_credits,
-					 handle->h_buffer_credits, err);
+					 jbd2_handle_buffer_credits(handle),
+					 err);
 		}
 	} else {
 		if (inode)
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 1920b976eef1..36aa72599646 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -288,13 +288,6 @@ static inline int ext4_handle_is_aborted(handle_t *handle)
 	return 0;
 }
 
-static inline int ext4_handle_has_enough_credits(handle_t *handle, int needed)
-{
-	if (ext4_handle_valid(handle) && handle->h_buffer_credits < needed)
-		return 0;
-	return 1;
-}
-
 #define ext4_journal_start_sb(sb, type, nblocks)			\
 	__ext4_journal_start_sb((sb), __LINE__, (type), (nblocks), 0)
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index b79d8ffd3e9b..48a9dbd27f43 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -2314,7 +2314,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 						   flags & XATTR_CREATE);
 		brelse(bh);
 
-		if (!ext4_handle_has_enough_credits(handle, credits)) {
+		if (jbd2_handle_buffer_credits(handle) < credits) {
 			error = -ENOSPC;
 			goto cleanup;
 		}
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 10e6049c0ba9..727ff91d7f3e 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1645,6 +1645,12 @@ static inline tid_t  jbd2_get_latest_transaction(journal_t *journal)
 	return tid;
 }
 
+
+static inline int jbd2_handle_buffer_credits(handle_t *handle)
+{
+	return handle->h_buffer_credits;
+}
+
 #ifdef __KERNEL__
 
 #define buffer_trace_init(bh)	do {} while (0)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 12/25] ocfs2: Use accessor function for h_buffer_credits
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (35 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 11/25] ext4, jbd2: Provide accessor function for handle credits Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 13/25] jbd2: Fix statistics for the number of logged blocks Jan Kara
                   ` (14 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Use the jbd2 accessor function for h_buffer_credits.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/alloc.c   | 32 ++++++++++++++++----------------
 fs/ocfs2/journal.c |  4 ++--
 2 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index f9baefc76cf9..88534eb0e7c2 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -2288,9 +2288,9 @@ static int ocfs2_extend_rotate_transaction(handle_t *handle, int subtree_depth,
 	int ret = 0;
 	int credits = (path->p_tree_depth - subtree_depth) * 2 + 1 + op_credits;
 
-	if (handle->h_buffer_credits < credits)
+	if (jbd2_handle_buffer_credits(handle) < credits)
 		ret = ocfs2_extend_trans(handle,
-					 credits - handle->h_buffer_credits);
+				credits - jbd2_handle_buffer_credits(handle));
 
 	return ret;
 }
@@ -2367,7 +2367,7 @@ static int ocfs2_rotate_tree_right(handle_t *handle,
 				   struct ocfs2_path *right_path,
 				   struct ocfs2_path **ret_left_path)
 {
-	int ret, start, orig_credits = handle->h_buffer_credits;
+	int ret, start, orig_credits = jbd2_handle_buffer_credits(handle);
 	u32 cpos;
 	struct ocfs2_path *left_path = NULL;
 	struct super_block *sb = ocfs2_metadata_cache_get_super(et->et_ci);
@@ -3148,7 +3148,7 @@ static int ocfs2_rotate_tree_left(handle_t *handle,
 				  struct ocfs2_path *path,
 				  struct ocfs2_cached_dealloc_ctxt *dealloc)
 {
-	int ret, orig_credits = handle->h_buffer_credits;
+	int ret, orig_credits = jbd2_handle_buffer_credits(handle);
 	struct ocfs2_path *tmp_path = NULL, *restart_path = NULL;
 	struct ocfs2_extent_block *eb;
 	struct ocfs2_extent_list *el;
@@ -3386,8 +3386,8 @@ static int ocfs2_merge_rec_right(struct ocfs2_path *left_path,
 							right_path);
 
 		ret = ocfs2_extend_rotate_transaction(handle, subtree_index,
-						      handle->h_buffer_credits,
-						      right_path);
+					jbd2_handle_buffer_credits(handle),
+					right_path);
 		if (ret) {
 			mlog_errno(ret);
 			goto out;
@@ -3548,8 +3548,8 @@ static int ocfs2_merge_rec_left(struct ocfs2_path *right_path,
 							right_path);
 
 		ret = ocfs2_extend_rotate_transaction(handle, subtree_index,
-						      handle->h_buffer_credits,
-						      left_path);
+					jbd2_handle_buffer_credits(handle),
+					left_path);
 		if (ret) {
 			mlog_errno(ret);
 			goto out;
@@ -3623,7 +3623,7 @@ static int ocfs2_merge_rec_left(struct ocfs2_path *right_path,
 		    le16_to_cpu(el->l_next_free_rec) == 1) {
 			/* extend credit for ocfs2_remove_rightmost_path */
 			ret = ocfs2_extend_rotate_transaction(handle, 0,
-					handle->h_buffer_credits,
+					jbd2_handle_buffer_credits(handle),
 					right_path);
 			if (ret) {
 				mlog_errno(ret);
@@ -3669,7 +3669,7 @@ static int ocfs2_try_to_merge_extent(handle_t *handle,
 	if (ctxt->c_split_covers_rec && ctxt->c_has_empty_extent) {
 		/* extend credit for ocfs2_remove_rightmost_path */
 		ret = ocfs2_extend_rotate_transaction(handle, 0,
-				handle->h_buffer_credits,
+				jbd2_handle_buffer_credits(handle),
 				path);
 		if (ret) {
 			mlog_errno(ret);
@@ -3725,7 +3725,7 @@ static int ocfs2_try_to_merge_extent(handle_t *handle,
 
 		/* extend credit for ocfs2_remove_rightmost_path */
 		ret = ocfs2_extend_rotate_transaction(handle, 0,
-					handle->h_buffer_credits,
+					jbd2_handle_buffer_credits(handle),
 					path);
 		if (ret) {
 			mlog_errno(ret);
@@ -3755,7 +3755,7 @@ static int ocfs2_try_to_merge_extent(handle_t *handle,
 
 		/* extend credit for ocfs2_remove_rightmost_path */
 		ret = ocfs2_extend_rotate_transaction(handle, 0,
-				handle->h_buffer_credits,
+				jbd2_handle_buffer_credits(handle),
 				path);
 		if (ret) {
 			mlog_errno(ret);
@@ -3799,7 +3799,7 @@ static int ocfs2_try_to_merge_extent(handle_t *handle,
 		if (ctxt->c_split_covers_rec) {
 			/* extend credit for ocfs2_remove_rightmost_path */
 			ret = ocfs2_extend_rotate_transaction(handle, 0,
-					handle->h_buffer_credits,
+					jbd2_handle_buffer_credits(handle),
 					path);
 			if (ret) {
 				mlog_errno(ret);
@@ -5358,7 +5358,7 @@ static int ocfs2_truncate_rec(handle_t *handle,
 	if (ocfs2_is_empty_extent(&el->l_recs[0]) && index > 0) {
 		/* extend credit for ocfs2_remove_rightmost_path */
 		ret = ocfs2_extend_rotate_transaction(handle, 0,
-				handle->h_buffer_credits,
+				jbd2_handle_buffer_credits(handle),
 				path);
 		if (ret) {
 			mlog_errno(ret);
@@ -5427,8 +5427,8 @@ static int ocfs2_truncate_rec(handle_t *handle,
 	}
 
 	ret = ocfs2_extend_rotate_transaction(handle, 0,
-					      handle->h_buffer_credits,
-					      path);
+					jbd2_handle_buffer_credits(handle),
+					path);
 	if (ret) {
 		mlog_errno(ret);
 		goto out;
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 930e3d388579..019aaf2a3f8a 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -419,7 +419,7 @@ int ocfs2_extend_trans(handle_t *handle, int nblocks)
 	if (!nblocks)
 		return 0;
 
-	old_nblocks = handle->h_buffer_credits;
+	old_nblocks = jbd2_handle_buffer_credits(handle);
 
 	trace_ocfs2_extend_trans(old_nblocks, nblocks);
 
@@ -460,7 +460,7 @@ int ocfs2_allocate_extend_trans(handle_t *handle, int thresh)
 
 	BUG_ON(!handle);
 
-	old_nblks = handle->h_buffer_credits;
+	old_nblks = jbd2_handle_buffer_credits(handle);
 	trace_ocfs2_allocate_extend_trans(old_nblks, thresh);
 
 	if (old_nblks < thresh)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 13/25] jbd2: Fix statistics for the number of logged blocks
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (36 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 12/25] ocfs2: Use accessor function for h_buffer_credits Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 14/25] jbd2: Reorganize jbd2_journal_stop() Jan Kara
                   ` (13 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

jbd2 statistics counting number of blocks logged in a transaction was
wrong. It didn't count the commit block and more importantly it didn't
count revoke descriptor blocks. Make sure these get properly counted.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/commit.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index c6d39f2ad828..b67e2d0cff88 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -726,7 +726,6 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 				submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
 			}
 			cond_resched();
-			stats.run.rs_blocks_logged += bufs;
 
 			/* Force a new descriptor to be generated next
                            time round the loop. */
@@ -813,6 +812,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 		if (unlikely(!buffer_uptodate(bh)))
 			err = -EIO;
 		jbd2_unfile_log_bh(bh);
+		stats.run.rs_blocks_logged++;
 
 		/*
 		 * The list contains temporary buffer heads created by
@@ -858,6 +858,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 		BUFFER_TRACE(bh, "ph5: control buffer writeout done: unfile");
 		clear_buffer_jwrite(bh);
 		jbd2_unfile_log_bh(bh);
+		stats.run.rs_blocks_logged++;
 		__brelse(bh);		/* One for getblk */
 		/* AKPM: bforget here */
 	}
@@ -879,6 +880,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	}
 	if (cbh)
 		err = journal_wait_on_commit_record(journal, cbh);
+	stats.run.rs_blocks_logged++;
 	if (jbd2_has_feature_async_commit(journal) &&
 	    journal->j_flags & JBD2_BARRIER) {
 		blkdev_issue_flush(journal->j_dev, GFP_NOFS, NULL);
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 14/25] jbd2: Reorganize jbd2_journal_stop()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (37 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 13/25] jbd2: Fix statistics for the number of logged blocks Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 15/25] jbd2: Drop pointless check from jbd2_journal_stop() Jan Kara
                   ` (12 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Move code in jbd2_journal_stop() around a bit. It removes some
unnecessary code duplication and will make factoring out parts common
with jbd2__journal_restart() easier.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 40 ++++++++++++++++------------------------
 1 file changed, 16 insertions(+), 24 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index bee8498d7792..6f560713f7f0 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1706,41 +1706,34 @@ int jbd2_journal_stop(handle_t *handle)
 	tid_t tid;
 	pid_t pid;
 
+	if (--handle->h_ref > 0) {
+		jbd_debug(4, "h_ref %d -> %d\n", handle->h_ref + 1,
+						 handle->h_ref);
+		if (is_handle_aborted(handle))
+			return -EIO;
+		return 0;
+	}
 	if (!transaction) {
 		/*
-		 * Handle is already detached from the transaction so
-		 * there is nothing to do other than decrease a refcount,
-		 * or free the handle if refcount drops to zero
+		 * Handle is already detached from the transaction so there is
+		 * nothing to do other than free the handle.
 		 */
-		if (--handle->h_ref > 0) {
-			jbd_debug(4, "h_ref %d -> %d\n", handle->h_ref + 1,
-							 handle->h_ref);
-			return err;
-		} else {
-			if (handle->h_rsv_handle)
-				jbd2_free_handle(handle->h_rsv_handle);
-			goto free_and_exit;
-		}
+		if (handle->h_rsv_handle)
+			jbd2_free_handle(handle->h_rsv_handle);
+		goto free_and_exit;
 	}
 	journal = transaction->t_journal;
+	tid = transaction->t_tid;
 
 	J_ASSERT(journal_current_handle() == handle);
+	J_ASSERT(atomic_read(&transaction->t_updates) > 0);
 
 	if (is_handle_aborted(handle))
 		err = -EIO;
-	else
-		J_ASSERT(atomic_read(&transaction->t_updates) > 0);
-
-	if (--handle->h_ref > 0) {
-		jbd_debug(4, "h_ref %d -> %d\n", handle->h_ref + 1,
-			  handle->h_ref);
-		return err;
-	}
 
 	jbd_debug(4, "Handle %p going down\n", handle);
 	trace_jbd2_handle_stats(journal->j_fs_dev->bd_dev,
-				transaction->t_tid,
-				handle->h_type, handle->h_line_no,
+				tid, handle->h_type, handle->h_line_no,
 				jiffies - handle->h_start_jiffies,
 				handle->h_sync, handle->h_requested_credits,
 				(handle->h_requested_credits -
@@ -1825,7 +1818,7 @@ int jbd2_journal_stop(handle_t *handle)
 		jbd_debug(2, "transaction too old, requesting commit for "
 					"handle %p\n", handle);
 		/* This is non-blocking */
-		jbd2_log_start_commit(journal, transaction->t_tid);
+		jbd2_log_start_commit(journal, tid);
 
 		/*
 		 * Special case: JBD2_SYNC synchronous updates require us
@@ -1841,7 +1834,6 @@ int jbd2_journal_stop(handle_t *handle)
 	 * once we do this, we must not dereference transaction
 	 * pointer again.
 	 */
-	tid = transaction->t_tid;
 	if (atomic_dec_and_test(&transaction->t_updates)) {
 		wake_up(&journal->j_wait_updates);
 		if (journal->j_barrier_count)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 15/25] jbd2: Drop pointless check from jbd2_journal_stop()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (38 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 14/25] jbd2: Reorganize jbd2_journal_stop() Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 16/25] jbd2: Drop pointless wakeup " Jan Kara
                   ` (11 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

If a transaction is larger than journal->j_max_transaction_buffers, that
is a bug and not a trigger for transaction commit. Also the very next
attempt to start new handle will start transaction commit anyway. So
just remove the pointless check. Arguably, we could start transaction
commit whenever the transaction size is *close* to
journal->j_max_transaction_buffers. This has a potential to reduce
latency of the next jbd2_journal_start() at the cost of somewhat smaller
transactions. However for this to have any effect, it would mean that
there isn't someone already waiting in jbd2_journal_start() which means
metadata load for the fs is pretty light anyway so probably this
optimization is not worth it.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 6f560713f7f0..a160c3f665f9 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1803,13 +1803,10 @@ int jbd2_journal_stop(handle_t *handle)
 
 	/*
 	 * If the handle is marked SYNC, we need to set another commit
-	 * going!  We also want to force a commit if the current
-	 * transaction is occupying too much of the log, or if the
-	 * transaction is too old now.
+	 * going!  We also want to force a commit if the transaction is too
+	 * old now.
 	 */
 	if (handle->h_sync ||
-	    (atomic_read(&transaction->t_outstanding_credits) >
-	     journal->j_max_transaction_buffers) ||
 	    time_after_eq(jiffies, transaction->t_expires)) {
 		/* Do this even for aborted journals: an abort still
 		 * completes the commit thread, it just doesn't write
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 16/25] jbd2: Drop pointless wakeup from jbd2_journal_stop()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (39 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 15/25] jbd2: Drop pointless check from jbd2_journal_stop() Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 17/25] jbd2: Factor out common parts of stopping and restarting a handle Jan Kara
                   ` (10 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

When we drop last handle from a transaction and journal->j_barrier_count
> 0, jbd2_journal_stop() wakes up journal->j_wait_transaction_locked
wait queue. This looks pointless - wait for outstanding handles always
happens on journal->j_wait_updates waitqueue.
journal->j_wait_transaction_locked is used to wait for transaction state
changes and by start_this_handle() for waiting until
journal->j_barrier_count drops to 0. The first case is clearly
irrelevant here since only jbd2 thread changes transaction state. The
second case looks related but jbd2_journal_unlock_updates() is
responsible for the wakeup in this case. So just drop the wakeup.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a160c3f665f9..d648cec3f90f 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1831,11 +1831,8 @@ int jbd2_journal_stop(handle_t *handle)
 	 * once we do this, we must not dereference transaction
 	 * pointer again.
 	 */
-	if (atomic_dec_and_test(&transaction->t_updates)) {
+	if (atomic_dec_and_test(&transaction->t_updates))
 		wake_up(&journal->j_wait_updates);
-		if (journal->j_barrier_count)
-			wake_up(&journal->j_wait_transaction_locked);
-	}
 
 	rwsem_release(&journal->j_trans_commit_map, 1, _THIS_IP_);
 
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 17/25] jbd2: Factor out common parts of stopping and restarting a handle
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (40 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 16/25] jbd2: Drop pointless wakeup " Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 18/25] jbd2: Account descriptor blocks into t_outstanding_credits Jan Kara
                   ` (9 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

jbd2__journal_restart() has quite some code that is common with
jbd2_journal_stop(). Factor this functionality into stop_this_handle()
helper and use it from both functions. Note that this also drops
t_handle_lock protection from jbd2__journal_restart() as
jbd2_journal_stop() does the same thing without it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 98 ++++++++++++++++++++++++---------------------------
 1 file changed, 46 insertions(+), 52 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index d648cec3f90f..b30df011beaa 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -512,12 +512,17 @@ handle_t *jbd2_journal_start(journal_t *journal, int nblocks)
 }
 EXPORT_SYMBOL(jbd2_journal_start);
 
-void jbd2_journal_free_reserved(handle_t *handle)
+static void __jbd2_journal_unreserve_handle(handle_t *handle)
 {
 	journal_t *journal = handle->h_journal;
 
 	WARN_ON(!handle->h_reserved);
 	sub_reserved_credits(journal, handle->h_buffer_credits);
+}
+
+void jbd2_journal_free_reserved(handle_t *handle)
+{
+	__jbd2_journal_unreserve_handle(handle);
 	jbd2_free_handle(handle);
 }
 EXPORT_SYMBOL(jbd2_journal_free_reserved);
@@ -655,6 +660,28 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
 	return result;
 }
 
+static void stop_this_handle(handle_t *handle)
+{
+	transaction_t *transaction = handle->h_transaction;
+	journal_t *journal = transaction->t_journal;
+
+	J_ASSERT(journal_current_handle() == handle);
+	J_ASSERT(atomic_read(&transaction->t_updates) > 0);
+	current->journal_info = NULL;
+	atomic_sub(handle->h_buffer_credits,
+		   &transaction->t_outstanding_credits);
+	if (handle->h_rsv_handle)
+		__jbd2_journal_unreserve_handle(handle->h_rsv_handle);
+	if (atomic_dec_and_test(&transaction->t_updates))
+		wake_up(&journal->j_wait_updates);
+
+	rwsem_release(&journal->j_trans_commit_map, 1, _THIS_IP_);
+	/*
+	 * Scope of the GFP_NOFS context is over here and so we can restore the
+	 * original alloc context.
+	 */
+	memalloc_nofs_restore(handle->saved_alloc_context);
+}
 
 /**
  * int jbd2_journal_restart() - restart a handle .
@@ -677,52 +704,34 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, gfp_t gfp_mask)
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal;
 	tid_t		tid;
-	int		need_to_start, ret;
+	int		need_to_start;
 
 	/* If we've had an abort of any type, don't even think about
 	 * actually doing the restart! */
 	if (is_handle_aborted(handle))
 		return 0;
 	journal = transaction->t_journal;
+	tid = transaction->t_tid;
 
 	/*
 	 * First unlink the handle from its current transaction, and start the
 	 * commit on that.
 	 */
-	J_ASSERT(atomic_read(&transaction->t_updates) > 0);
-	J_ASSERT(journal_current_handle() == handle);
-
-	read_lock(&journal->j_state_lock);
-	spin_lock(&transaction->t_handle_lock);
-	atomic_sub(handle->h_buffer_credits,
-		   &transaction->t_outstanding_credits);
-	if (handle->h_rsv_handle) {
-		sub_reserved_credits(journal,
-				     handle->h_rsv_handle->h_buffer_credits);
-	}
-	if (atomic_dec_and_test(&transaction->t_updates))
-		wake_up(&journal->j_wait_updates);
-	tid = transaction->t_tid;
-	spin_unlock(&transaction->t_handle_lock);
+	jbd_debug(2, "restarting handle %p\n", handle);
+	stop_this_handle(handle);
 	handle->h_transaction = NULL;
-	current->journal_info = NULL;
 
-	jbd_debug(2, "restarting handle %p\n", handle);
+	/*
+	 * TODO: If we use READ_ONCE / WRITE_ONCE for j_commit_request we can
+ 	 * get rid of pointless j_state_lock traffic like this.
+	 */
+	read_lock(&journal->j_state_lock);
 	need_to_start = !tid_geq(journal->j_commit_request, tid);
 	read_unlock(&journal->j_state_lock);
 	if (need_to_start)
 		jbd2_log_start_commit(journal, tid);
-
-	rwsem_release(&journal->j_trans_commit_map, 1, _THIS_IP_);
 	handle->h_buffer_credits = nblocks;
-	/*
-	 * Restore the original nofs context because the journal restart
-	 * is basically the same thing as journal stop and start.
-	 * start_this_handle will start a new nofs context.
-	 */
-	memalloc_nofs_restore(handle->saved_alloc_context);
-	ret = start_this_handle(journal, handle, gfp_mask);
-	return ret;
+	return start_this_handle(journal, handle, gfp_mask);
 }
 EXPORT_SYMBOL(jbd2__journal_restart);
 
@@ -1718,16 +1727,12 @@ int jbd2_journal_stop(handle_t *handle)
 		 * Handle is already detached from the transaction so there is
 		 * nothing to do other than free the handle.
 		 */
-		if (handle->h_rsv_handle)
-			jbd2_free_handle(handle->h_rsv_handle);
+		memalloc_nofs_restore(handle->saved_alloc_context);
 		goto free_and_exit;
 	}
 	journal = transaction->t_journal;
 	tid = transaction->t_tid;
 
-	J_ASSERT(journal_current_handle() == handle);
-	J_ASSERT(atomic_read(&transaction->t_updates) > 0);
-
 	if (is_handle_aborted(handle))
 		err = -EIO;
 
@@ -1797,9 +1802,6 @@ int jbd2_journal_stop(handle_t *handle)
 
 	if (handle->h_sync)
 		transaction->t_synchronous_commit = 1;
-	current->journal_info = NULL;
-	atomic_sub(handle->h_buffer_credits,
-		   &transaction->t_outstanding_credits);
 
 	/*
 	 * If the handle is marked SYNC, we need to set another commit
@@ -1826,27 +1828,19 @@ int jbd2_journal_stop(handle_t *handle)
 	}
 
 	/*
-	 * Once we drop t_updates, if it goes to zero the transaction
-	 * could start committing on us and eventually disappear.  So
-	 * once we do this, we must not dereference transaction
-	 * pointer again.
+	 * Once stop_this_handle() drops t_updates, the transaction could start
+	 * committing on us and eventually disappear.  So we must not
+	 * dereference transaction pointer again after calling
+	 * stop_this_handle().
 	 */
-	if (atomic_dec_and_test(&transaction->t_updates))
-		wake_up(&journal->j_wait_updates);
-
-	rwsem_release(&journal->j_trans_commit_map, 1, _THIS_IP_);
+	stop_this_handle(handle);
 
 	if (wait_for_commit)
 		err = jbd2_log_wait_commit(journal, tid);
 
-	if (handle->h_rsv_handle)
-		jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
-	/*
-	 * Scope of the GFP_NOFS context is over here and so we can restore the
-	 * original alloc context.
-	 */
-	memalloc_nofs_restore(handle->saved_alloc_context);
+	if (handle->h_rsv_handle)
+		jbd2_free_handle(handle->h_rsv_handle);
 	jbd2_free_handle(handle);
 	return err;
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 18/25] jbd2: Account descriptor blocks into t_outstanding_credits
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (41 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 17/25] jbd2: Factor out common parts of stopping and restarting a handle Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 19/25] jbd2: Drop jbd2_space_needed() Jan Kara
                   ` (8 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Currently, journal descriptor blocks were not accounted in
transaction->t_outstanding_credits and we were just leaving some slack
space in the journal for them (in jbd2_log_space_left() and
jbd2_space_needed()). This is making proper accounting (and reservation
we want to add) of descriptor blocks difficult so switch to accounting
descriptor blocks in transaction->t_outstanding_credits and just reserve
the same amount of credits in t_outstanding credits for journal
descriptor blocks when creating transaction.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/commit.c      |  6 ++++--
 fs/jbd2/journal.c     |  1 +
 fs/jbd2/transaction.c | 20 ++++++++++++--------
 include/linux/jbd2.h  | 22 +++++++---------------
 4 files changed, 24 insertions(+), 25 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b67e2d0cff88..9047f8e269d0 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -560,8 +560,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	stats.run.rs_logging = jiffies;
 	stats.run.rs_flushing = jbd2_time_diff(stats.run.rs_flushing,
 					       stats.run.rs_logging);
-	stats.run.rs_blocks =
-		atomic_read(&commit_transaction->t_outstanding_credits);
+	stats.run.rs_blocks = commit_transaction->t_nr_buffers;
 	stats.run.rs_blocks_logged = 0;
 
 	J_ASSERT(commit_transaction->t_nr_buffers <=
@@ -889,6 +888,9 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	if (err)
 		jbd2_journal_abort(journal, err);
 
+	WARN_ON_ONCE(
+		atomic_read(&commit_transaction->t_outstanding_credits) < 0);
+
 	/*
 	 * Now disk caches for filesystem device are flushed so we are safe to
 	 * erase checkpointed transactions from the log by updating journal
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index cc11097f1176..22b14b3ca197 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -840,6 +840,7 @@ jbd2_journal_get_descriptor_buffer(transaction_t *transaction, int type)
 	bh = __getblk(journal->j_dev, blocknr, journal->j_blocksize);
 	if (!bh)
 		return NULL;
+	atomic_dec(&transaction->t_outstanding_credits);
 	lock_buffer(bh);
 	memset(bh->b_data, 0, journal->j_blocksize);
 	header = (journal_header_t *)bh->b_data;
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index b30df011beaa..ed7cf9e62584 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -62,6 +62,17 @@ void jbd2_journal_free_transaction(transaction_t *transaction)
 	kmem_cache_free(transaction_cache, transaction);
 }
 
+/*
+ * We reserve t_outstanding_credits >> JBD2_CONTROL_BLOCKS_SHIFT for
+ * transaction descriptor blocks.
+ */
+#define JBD2_CONTROL_BLOCKS_SHIFT 5
+
+static int jbd2_descriptor_blocks_per_trans(journal_t *journal)
+{
+	return journal->j_max_transaction_buffers >> JBD2_CONTROL_BLOCKS_SHIFT;
+}
+
 /*
  * jbd2_get_transaction: obtain a new transaction_t object.
  *
@@ -88,6 +99,7 @@ static void jbd2_get_transaction(journal_t *journal,
 	spin_lock_init(&transaction->t_handle_lock);
 	atomic_set(&transaction->t_updates, 0);
 	atomic_set(&transaction->t_outstanding_credits,
+		   jbd2_descriptor_blocks_per_trans(journal) +
 		   atomic_read(&journal->j_reserved_credits));
 	atomic_set(&transaction->t_handle_count, 0);
 	INIT_LIST_HEAD(&transaction->t_inode_list);
@@ -634,14 +646,6 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
 		goto unlock;
 	}
 
-	if (wanted + (wanted >> JBD2_CONTROL_BLOCKS_SHIFT) >
-	    jbd2_log_space_left(journal)) {
-		jbd_debug(3, "denied handle %p %d blocks: "
-			  "insufficient log space\n", handle, nblocks);
-		atomic_sub(nblocks, &transaction->t_outstanding_credits);
-		goto unlock;
-	}
-
 	trace_jbd2_handle_extend(journal->j_fs_dev->bd_dev,
 				 transaction->t_tid,
 				 handle->h_type, handle->h_line_no,
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 727ff91d7f3e..bef4f74b1ea0 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -681,8 +681,10 @@ struct transaction_s
 	atomic_t		t_updates;
 
 	/*
-	 * Number of buffers reserved for use by all handles in this transaction
-	 * handle but not yet modified. [none]
+	 * Number of blocks reserved for this transaction in the journal.
+	 * This is including all credits reserved when starting transaction
+	 * handles as well as all journal descriptor blocks needed for this
+	 * transaction. [none]
 	 */
 	atomic_t		t_outstanding_credits;
 
@@ -1560,20 +1562,13 @@ static inline int jbd2_journal_has_csum_v2or3(journal_t *journal)
 	return journal->j_chksum_driver != NULL;
 }
 
-/*
- * We reserve t_outstanding_credits >> JBD2_CONTROL_BLOCKS_SHIFT for
- * transaction control blocks.
- */
-#define JBD2_CONTROL_BLOCKS_SHIFT 5
-
 /*
  * Return the minimum number of blocks which must be free in the journal
  * before a new transaction may be started.  Must be called under j_state_lock.
  */
 static inline int jbd2_space_needed(journal_t *journal)
 {
-	int nblocks = journal->j_max_transaction_buffers;
-	return nblocks + (nblocks >> JBD2_CONTROL_BLOCKS_SHIFT);
+	return journal->j_max_transaction_buffers;
 }
 
 /*
@@ -1585,11 +1580,8 @@ static inline unsigned long jbd2_log_space_left(journal_t *journal)
 	long free = journal->j_free - 32;
 
 	if (journal->j_committing_transaction) {
-		unsigned long committing = atomic_read(&journal->
-			j_committing_transaction->t_outstanding_credits);
-
-		/* Transaction + control blocks */
-		free -= committing + (committing >> JBD2_CONTROL_BLOCKS_SHIFT);
+		free -= atomic_read(&journal->
+                        j_committing_transaction->t_outstanding_credits);
 	}
 	return max_t(long, free, 0);
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 19/25] jbd2: Drop jbd2_space_needed()
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (42 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 18/25] jbd2: Account descriptor blocks into t_outstanding_credits Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 20/25] jbd2: Reserve space for revoke descriptor blocks Jan Kara
                   ` (7 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

The function is now just a trivial wrapper returning
journal->j_max_transaction_buffers. Drop it.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/checkpoint.c  | 2 +-
 fs/jbd2/transaction.c | 5 +++--
 include/linux/jbd2.h  | 9 ---------
 3 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
index a1909066bde6..8fff6677a5da 100644
--- a/fs/jbd2/checkpoint.c
+++ b/fs/jbd2/checkpoint.c
@@ -110,7 +110,7 @@ void __jbd2_log_wait_for_space(journal_t *journal)
 	int nblocks, space_left;
 	/* assert_spin_locked(&journal->j_state_lock); */
 
-	nblocks = jbd2_space_needed(journal);
+	nblocks = journal->j_max_transaction_buffers;
 	while (jbd2_log_space_left(journal) < nblocks) {
 		write_unlock(&journal->j_state_lock);
 		mutex_lock_io(&journal->j_checkpoint_mutex);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index ed7cf9e62584..ba388da7e02b 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -270,12 +270,13 @@ static int add_transaction_credits(journal_t *journal, int blocks,
 	 * *before* starting to dirty potentially checkpointed buffers
 	 * in the new transaction.
 	 */
-	if (jbd2_log_space_left(journal) < jbd2_space_needed(journal)) {
+	if (jbd2_log_space_left(journal) < journal->j_max_transaction_buffers) {
 		atomic_sub(total, &t->t_outstanding_credits);
 		read_unlock(&journal->j_state_lock);
 		jbd2_might_wait_for_commit(journal);
 		write_lock(&journal->j_state_lock);
-		if (jbd2_log_space_left(journal) < jbd2_space_needed(journal))
+		if (jbd2_log_space_left(journal) <
+					journal->j_max_transaction_buffers)
 			__jbd2_log_wait_for_space(journal);
 		write_unlock(&journal->j_state_lock);
 		return 1;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index bef4f74b1ea0..1dd2703a8e26 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1562,15 +1562,6 @@ static inline int jbd2_journal_has_csum_v2or3(journal_t *journal)
 	return journal->j_chksum_driver != NULL;
 }
 
-/*
- * Return the minimum number of blocks which must be free in the journal
- * before a new transaction may be started.  Must be called under j_state_lock.
- */
-static inline int jbd2_space_needed(journal_t *journal)
-{
-	return journal->j_max_transaction_buffers;
-}
-
 /*
  * Return number of free blocks in the log. Must be called under j_state_lock.
  */
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 20/25] jbd2: Reserve space for revoke descriptor blocks
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (43 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 19/25] jbd2: Drop jbd2_space_needed() Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-15  7:52   ` Eric Biggers
  2019-11-05 16:44 ` [PATCH 21/25] jbd2: Rename h_buffer_credits to h_total_credits Jan Kara
                   ` (6 subsequent siblings)
  51 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Extend functions for starting, extending, and restarting transaction
handles to take number of revoke records handle must be able to
accommodate. These functions then make sure transaction has enough
credits to be able to store resulting revoke descriptor blocks. Also
revoke code tracks number of revoke records created by a handle to catch
situation where some place didn't reserve enough space for revoke
records. Similarly to standard transaction credits, space for unused
reserved revoke records is released when the handle is stopped.

On the ext4 side we currently take a simplistic approach of reserving
space for 1024 revoke records for any transaction. This grows amount of
credits reserved for each handle only by a few and is enough for any
normal workload so that we don't hit warnings in jbd2. We will refine
the logic in following commits.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4_jbd2.c   |  2 +-
 fs/ext4/ext4_jbd2.h   |  4 ++--
 fs/jbd2/journal.c     | 21 ++++++++++++++++++++
 fs/jbd2/revoke.c      |  6 ++++++
 fs/jbd2/transaction.c | 54 ++++++++++++++++++++++++++++++++++++++++++++-------
 fs/ocfs2/journal.c    |  4 ++--
 include/linux/jbd2.h  | 43 ++++++++++++++++++++++++++++++----------
 7 files changed, 112 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 731bbfdbce5b..b81190bee32d 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -78,7 +78,7 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
 	journal = EXT4_SB(sb)->s_journal;
 	if (!journal)
 		return ext4_get_nojournal();
-	return jbd2__journal_start(journal, blocks, rsv_blocks, GFP_NOFS,
+	return jbd2__journal_start(journal, blocks, rsv_blocks, 1024, GFP_NOFS,
 				   type, line);
 }
 
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 36aa72599646..aca05e52e317 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -328,14 +328,14 @@ static inline handle_t *ext4_journal_current_handle(void)
 static inline int ext4_journal_extend(handle_t *handle, int nblocks)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2_journal_extend(handle, nblocks);
+		return jbd2_journal_extend(handle, nblocks, 1024);
 	return 0;
 }
 
 static inline int ext4_journal_restart(handle_t *handle, int nblocks)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2_journal_restart(handle, nblocks);
+		return jbd2__journal_restart(handle, nblocks, 1024, GFP_NOFS);
 	return 0;
 }
 
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 22b14b3ca197..eef809f61722 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1500,6 +1500,21 @@ void jbd2_journal_update_sb_errno(journal_t *journal)
 }
 EXPORT_SYMBOL(jbd2_journal_update_sb_errno);
 
+static int journal_revoke_records_per_block(journal_t *journal)
+{
+	int record_size;
+	int space = journal->j_blocksize - sizeof(jbd2_journal_revoke_header_t);
+
+	if (jbd2_has_feature_64bit(journal))
+		record_size = 8;
+	else
+		record_size = 4;
+
+	if (jbd2_journal_has_csum_v2or3(journal))
+		space -= sizeof(struct jbd2_journal_block_tail);
+	return space / record_size;
+}
+
 /*
  * Read the superblock for a given journal, performing initial
  * validation of the format.
@@ -1608,6 +1623,8 @@ static int journal_get_superblock(journal_t *journal)
 						   sizeof(sb->s_uuid));
 	}
 
+	journal->j_revoke_records_per_block =
+				journal_revoke_records_per_block(journal);
 	set_buffer_verified(bh);
 
 	return 0;
@@ -1928,6 +1945,8 @@ int jbd2_journal_set_features (journal_t *journal, unsigned long compat,
 	sb->s_feature_ro_compat |= cpu_to_be32(ro);
 	sb->s_feature_incompat  |= cpu_to_be32(incompat);
 	unlock_buffer(journal->j_sb_buffer);
+	journal->j_revoke_records_per_block =
+				journal_revoke_records_per_block(journal);
 
 	return 1;
 #undef COMPAT_FEATURE_ON
@@ -1958,6 +1977,8 @@ void jbd2_journal_clear_features(journal_t *journal, unsigned long compat,
 	sb->s_feature_compat    &= ~cpu_to_be32(compat);
 	sb->s_feature_ro_compat &= ~cpu_to_be32(ro);
 	sb->s_feature_incompat  &= ~cpu_to_be32(incompat);
+	journal->j_revoke_records_per_block =
+				journal_revoke_records_per_block(journal);
 }
 EXPORT_SYMBOL(jbd2_journal_clear_features);
 
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index f08073d7bbf5..fa608788b93d 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -371,6 +371,11 @@ int jbd2_journal_revoke(handle_t *handle, unsigned long long blocknr,
 	}
 #endif
 
+	if (WARN_ON_ONCE(handle->h_revoke_credits <= 0)) {
+		if (!bh_in)
+			brelse(bh);
+		return -EIO;
+	}
 	/* We really ought not ever to revoke twice in a row without
            first having the revoke cancelled: it's illegal to free a
            block twice without allocating it in between! */
@@ -391,6 +396,7 @@ int jbd2_journal_revoke(handle_t *handle, unsigned long long blocknr,
 			__brelse(bh);
 		}
 	}
+	handle->h_revoke_credits--;
 
 	jbd_debug(2, "insert revoke for block %llu, bh_in=%p\n",blocknr, bh_in);
 	err = insert_revoke_hash(journal, blocknr,
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index ba388da7e02b..1c121afbcf8f 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -101,6 +101,7 @@ static void jbd2_get_transaction(journal_t *journal,
 	atomic_set(&transaction->t_outstanding_credits,
 		   jbd2_descriptor_blocks_per_trans(journal) +
 		   atomic_read(&journal->j_reserved_credits));
+	atomic_set(&transaction->t_outstanding_revokes, 0);
 	atomic_set(&transaction->t_handle_count, 0);
 	INIT_LIST_HEAD(&transaction->t_inode_list);
 	INIT_LIST_HEAD(&transaction->t_private_list);
@@ -418,6 +419,7 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 	update_t_max_wait(transaction, ts);
 	handle->h_transaction = transaction;
 	handle->h_requested_credits = blocks;
+	handle->h_revoke_credits_requested = handle->h_revoke_credits;
 	handle->h_start_jiffies = jiffies;
 	atomic_inc(&transaction->t_updates);
 	atomic_inc(&transaction->t_handle_count);
@@ -451,8 +453,8 @@ static handle_t *new_handle(int nblocks)
 }
 
 handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
-			      gfp_t gfp_mask, unsigned int type,
-			      unsigned int line_no)
+			      int revoke_records, gfp_t gfp_mask,
+			      unsigned int type, unsigned int line_no)
 {
 	handle_t *handle = journal_current_handle();
 	int err;
@@ -466,6 +468,8 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 		return handle;
 	}
 
+	nblocks += DIV_ROUND_UP(revoke_records,
+				journal->j_revoke_records_per_block);
 	handle = new_handle(nblocks);
 	if (!handle)
 		return ERR_PTR(-ENOMEM);
@@ -481,6 +485,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 		rsv_handle->h_journal = journal;
 		handle->h_rsv_handle = rsv_handle;
 	}
+	handle->h_revoke_credits = revoke_records;
 
 	err = start_this_handle(journal, handle, gfp_mask);
 	if (err < 0) {
@@ -521,7 +526,7 @@ EXPORT_SYMBOL(jbd2__journal_start);
  */
 handle_t *jbd2_journal_start(journal_t *journal, int nblocks)
 {
-	return jbd2__journal_start(journal, nblocks, 0, GFP_NOFS, 0, 0);
+	return jbd2__journal_start(journal, nblocks, 0, 0, GFP_NOFS, 0, 0);
 }
 EXPORT_SYMBOL(jbd2_journal_start);
 
@@ -598,6 +603,7 @@ EXPORT_SYMBOL(jbd2_journal_start_reserved);
  * int jbd2_journal_extend() - extend buffer credits.
  * @handle:  handle to 'extend'
  * @nblocks: nr blocks to try to extend by.
+ * @revoke_records: number of revoke records to try to extend by.
  *
  * Some transactions, such as large extends and truncates, can be done
  * atomically all at once or in several stages.  The operation requests
@@ -614,7 +620,7 @@ EXPORT_SYMBOL(jbd2_journal_start_reserved);
  * return code < 0 implies an error
  * return code > 0 implies normal transaction-full status.
  */
-int jbd2_journal_extend(handle_t *handle, int nblocks)
+int jbd2_journal_extend(handle_t *handle, int nblocks, int revoke_records)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal;
@@ -636,6 +642,12 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
 		goto error_out;
 	}
 
+	nblocks += DIV_ROUND_UP(
+			handle->h_revoke_credits_requested + revoke_records,
+			journal->j_revoke_records_per_block) -
+		DIV_ROUND_UP(
+			handle->h_revoke_credits_requested,
+			journal->j_revoke_records_per_block);
 	spin_lock(&transaction->t_handle_lock);
 	wanted = atomic_add_return(nblocks,
 				   &transaction->t_outstanding_credits);
@@ -655,6 +667,8 @@ int jbd2_journal_extend(handle_t *handle, int nblocks)
 
 	handle->h_buffer_credits += nblocks;
 	handle->h_requested_credits += nblocks;
+	handle->h_revoke_credits += revoke_records;
+	handle->h_revoke_credits_requested += revoke_records;
 	result = 0;
 
 	jbd_debug(3, "extended handle %p by %d\n", handle, nblocks);
@@ -669,10 +683,31 @@ static void stop_this_handle(handle_t *handle)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal = transaction->t_journal;
+	int revokes;
 
 	J_ASSERT(journal_current_handle() == handle);
 	J_ASSERT(atomic_read(&transaction->t_updates) > 0);
 	current->journal_info = NULL;
+	/*
+	 * Subtract necessary revoke descriptor blocks from handle credits. We
+	 * take care to account only for revoke descriptor blocks the
+	 * transaction will really need as large sequences of transactions with
+	 * small numbers of revokes are relatively common.
+	 */
+	revokes = handle->h_revoke_credits_requested - handle->h_revoke_credits;
+	if (revokes) {
+		int t_revokes, revoke_descriptors;
+		int rr_per_blk = journal->j_revoke_records_per_block;
+
+		WARN_ON_ONCE(DIV_ROUND_UP(revokes, rr_per_blk)
+				> handle->h_buffer_credits);
+		t_revokes = atomic_add_return(revokes,
+				&transaction->t_outstanding_revokes);
+		revoke_descriptors =
+			DIV_ROUND_UP(t_revokes, rr_per_blk) -
+			DIV_ROUND_UP(t_revokes - revokes, rr_per_blk);
+		handle->h_buffer_credits -= revoke_descriptors;
+	}
 	atomic_sub(handle->h_buffer_credits,
 		   &transaction->t_outstanding_credits);
 	if (handle->h_rsv_handle)
@@ -692,6 +727,7 @@ static void stop_this_handle(handle_t *handle)
  * int jbd2_journal_restart() - restart a handle .
  * @handle:  handle to restart
  * @nblocks: nr credits requested
+ * @revoke_records: number of revoke record credits requested
  * @gfp_mask: memory allocation flags (for start_this_handle)
  *
  * Restart a handle for a multi-transaction filesystem
@@ -704,7 +740,8 @@ static void stop_this_handle(handle_t *handle)
  * credits. We preserve reserved handle if there's any attached to the
  * passed in handle.
  */
-int jbd2__journal_restart(handle_t *handle, int nblocks, gfp_t gfp_mask)
+int jbd2__journal_restart(handle_t *handle, int nblocks, int revoke_records,
+			  gfp_t gfp_mask)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal;
@@ -735,7 +772,10 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, gfp_t gfp_mask)
 	read_unlock(&journal->j_state_lock);
 	if (need_to_start)
 		jbd2_log_start_commit(journal, tid);
-	handle->h_buffer_credits = nblocks;
+	handle->h_buffer_credits = nblocks +
+		DIV_ROUND_UP(revoke_records,
+			     journal->j_revoke_records_per_block);
+	handle->h_revoke_credits = revoke_records;
 	return start_this_handle(journal, handle, gfp_mask);
 }
 EXPORT_SYMBOL(jbd2__journal_restart);
@@ -743,7 +783,7 @@ EXPORT_SYMBOL(jbd2__journal_restart);
 
 int jbd2_journal_restart(handle_t *handle, int nblocks)
 {
-	return jbd2__journal_restart(handle, nblocks, GFP_NOFS);
+	return jbd2__journal_restart(handle, nblocks, 0, GFP_NOFS);
 }
 EXPORT_SYMBOL(jbd2_journal_restart);
 
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 019aaf2a3f8a..a032f0297dad 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -426,7 +426,7 @@ int ocfs2_extend_trans(handle_t *handle, int nblocks)
 #ifdef CONFIG_OCFS2_DEBUG_FS
 	status = 1;
 #else
-	status = jbd2_journal_extend(handle, nblocks);
+	status = jbd2_journal_extend(handle, nblocks, 0);
 	if (status < 0) {
 		mlog_errno(status);
 		goto bail;
@@ -466,7 +466,7 @@ int ocfs2_allocate_extend_trans(handle_t *handle, int thresh)
 	if (old_nblks < thresh)
 		return 0;
 
-	status = jbd2_journal_extend(handle, OCFS2_MAX_TRANS_DATA);
+	status = jbd2_journal_extend(handle, OCFS2_MAX_TRANS_DATA, 0);
 	if (status < 0) {
 		mlog_errno(status);
 		goto bail;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 1dd2703a8e26..2a3d5f50e7a1 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -478,6 +478,7 @@ struct jbd2_revoke_table_s;
  * @h_journal: Which journal handle belongs to - used iff h_reserved set.
  * @h_rsv_handle: Handle reserved for finishing the logical operation.
  * @h_buffer_credits: Number of remaining buffers we are allowed to dirty.
+ * @h_revoke_credits: Number of remaining revoke records available for handle
  * @h_ref: Reference count on this handle.
  * @h_err: Field for caller's use to track errors through large fs operations.
  * @h_sync: Flag for sync-on-close.
@@ -488,6 +489,7 @@ struct jbd2_revoke_table_s;
  * @h_line_no: For handle statistics.
  * @h_start_jiffies: Handle Start time.
  * @h_requested_credits: Holds @h_buffer_credits after handle is started.
+ * @h_revoke_credits_requested: Holds @h_revoke_credits after handle is started.
  * @saved_alloc_context: Saved context while transaction is open.
  **/
 
@@ -505,6 +507,8 @@ struct jbd2_journal_handle
 
 	handle_t		*h_rsv_handle;
 	int			h_buffer_credits;
+	int			h_revoke_credits;
+	int			h_revoke_credits_requested;
 	int			h_ref;
 	int			h_err;
 
@@ -688,6 +692,17 @@ struct transaction_s
 	 */
 	atomic_t		t_outstanding_credits;
 
+	/*
+	 * Number of revoke records for this transaction added by already
+	 * stopped handles. [none]
+	 */
+	atomic_t		t_outstanding_revokes;
+
+	/*
+	 * How many handles used this transaction? [none]
+	 */
+	atomic_t		t_handle_count;
+
 	/*
 	 * Forward and backward links for the circular list of all transactions
 	 * awaiting checkpoint. [j_list_lock]
@@ -705,11 +720,6 @@ struct transaction_s
 	 */
 	ktime_t			t_start_time;
 
-	/*
-	 * How many handles used this transaction? [none]
-	 */
-	atomic_t		t_handle_count;
-
 	/*
 	 * This transaction is being forced and some process is
 	 * waiting for it to finish.
@@ -1026,6 +1036,13 @@ struct journal_s
 	 */
 	int			j_max_transaction_buffers;
 
+	/**
+	 * @j_revoke_records_per_block:
+	 *
+	 * Number of revoke records that fit in one descriptor block.
+	 */
+	int			j_revoke_records_per_block;
+
 	/**
 	 * @j_commit_interval:
 	 *
@@ -1360,14 +1377,16 @@ static inline handle_t *journal_current_handle(void)
 
 extern handle_t *jbd2_journal_start(journal_t *, int nblocks);
 extern handle_t *jbd2__journal_start(journal_t *, int blocks, int rsv_blocks,
-				     gfp_t gfp_mask, unsigned int type,
-				     unsigned int line_no);
+				     int revoke_records, gfp_t gfp_mask,
+				     unsigned int type, unsigned int line_no);
 extern int	 jbd2_journal_restart(handle_t *, int nblocks);
-extern int	 jbd2__journal_restart(handle_t *, int nblocks, gfp_t gfp_mask);
+extern int	 jbd2__journal_restart(handle_t *, int nblocks,
+				       int revoke_records, gfp_t gfp_mask);
 extern int	 jbd2_journal_start_reserved(handle_t *handle,
 				unsigned int type, unsigned int line_no);
 extern void	 jbd2_journal_free_reserved(handle_t *handle);
-extern int	 jbd2_journal_extend (handle_t *, int nblocks);
+extern int	 jbd2_journal_extend(handle_t *handle, int nblocks,
+				     int revoke_records);
 extern int	 jbd2_journal_get_write_access(handle_t *, struct buffer_head *);
 extern int	 jbd2_journal_get_create_access (handle_t *, struct buffer_head *);
 extern int	 jbd2_journal_get_undo_access(handle_t *, struct buffer_head *);
@@ -1631,7 +1650,11 @@ static inline tid_t  jbd2_get_latest_transaction(journal_t *journal)
 
 static inline int jbd2_handle_buffer_credits(handle_t *handle)
 {
-	return handle->h_buffer_credits;
+	journal_t *journal = handle->h_transaction->t_journal;
+
+	return handle->h_buffer_credits -
+		DIV_ROUND_UP(handle->h_revoke_credits_requested,
+			     journal->j_revoke_records_per_block);
 }
 
 #ifdef __KERNEL__
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 21/25] jbd2: Rename h_buffer_credits to h_total_credits
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (44 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 20/25] jbd2: Reserve space for revoke descriptor blocks Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 22/25] jbd2: Make credit checking more strict Jan Kara
                   ` (5 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

The credit counter now contains both buffer and revoke descriptor block
credits. Rename to counter to h_total_credits to reflect that. No
functional change.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 30 +++++++++++++++---------------
 include/linux/jbd2.h  |  9 +++++----
 2 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 1c121afbcf8f..10fd802fd222 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -313,12 +313,12 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 			     gfp_t gfp_mask)
 {
 	transaction_t	*transaction, *new_transaction = NULL;
-	int		blocks = handle->h_buffer_credits;
+	int		blocks = handle->h_total_credits;
 	int		rsv_blocks = 0;
 	unsigned long ts = jiffies;
 
 	if (handle->h_rsv_handle)
-		rsv_blocks = handle->h_rsv_handle->h_buffer_credits;
+		rsv_blocks = handle->h_rsv_handle->h_total_credits;
 
 	/*
 	 * Limit the number of reserved credits to 1/2 of maximum transaction
@@ -446,7 +446,7 @@ static handle_t *new_handle(int nblocks)
 	handle_t *handle = jbd2_alloc_handle(GFP_NOFS);
 	if (!handle)
 		return NULL;
-	handle->h_buffer_credits = nblocks;
+	handle->h_total_credits = nblocks;
 	handle->h_ref = 1;
 
 	return handle;
@@ -535,7 +535,7 @@ static void __jbd2_journal_unreserve_handle(handle_t *handle)
 	journal_t *journal = handle->h_journal;
 
 	WARN_ON(!handle->h_reserved);
-	sub_reserved_credits(journal, handle->h_buffer_credits);
+	sub_reserved_credits(journal, handle->h_total_credits);
 }
 
 void jbd2_journal_free_reserved(handle_t *handle)
@@ -594,7 +594,7 @@ int jbd2_journal_start_reserved(handle_t *handle, unsigned int type,
 	handle->h_line_no = line_no;
 	trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
 				handle->h_transaction->t_tid, type,
-				line_no, handle->h_buffer_credits);
+				line_no, handle->h_total_credits);
 	return 0;
 }
 EXPORT_SYMBOL(jbd2_journal_start_reserved);
@@ -662,10 +662,10 @@ int jbd2_journal_extend(handle_t *handle, int nblocks, int revoke_records)
 	trace_jbd2_handle_extend(journal->j_fs_dev->bd_dev,
 				 transaction->t_tid,
 				 handle->h_type, handle->h_line_no,
-				 handle->h_buffer_credits,
+				 handle->h_total_credits,
 				 nblocks);
 
-	handle->h_buffer_credits += nblocks;
+	handle->h_total_credits += nblocks;
 	handle->h_requested_credits += nblocks;
 	handle->h_revoke_credits += revoke_records;
 	handle->h_revoke_credits_requested += revoke_records;
@@ -700,15 +700,15 @@ static void stop_this_handle(handle_t *handle)
 		int rr_per_blk = journal->j_revoke_records_per_block;
 
 		WARN_ON_ONCE(DIV_ROUND_UP(revokes, rr_per_blk)
-				> handle->h_buffer_credits);
+				> handle->h_total_credits);
 		t_revokes = atomic_add_return(revokes,
 				&transaction->t_outstanding_revokes);
 		revoke_descriptors =
 			DIV_ROUND_UP(t_revokes, rr_per_blk) -
 			DIV_ROUND_UP(t_revokes - revokes, rr_per_blk);
-		handle->h_buffer_credits -= revoke_descriptors;
+		handle->h_total_credits -= revoke_descriptors;
 	}
-	atomic_sub(handle->h_buffer_credits,
+	atomic_sub(handle->h_total_credits,
 		   &transaction->t_outstanding_credits);
 	if (handle->h_rsv_handle)
 		__jbd2_journal_unreserve_handle(handle->h_rsv_handle);
@@ -772,7 +772,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int revoke_records,
 	read_unlock(&journal->j_state_lock);
 	if (need_to_start)
 		jbd2_log_start_commit(journal, tid);
-	handle->h_buffer_credits = nblocks +
+	handle->h_total_credits = nblocks +
 		DIV_ROUND_UP(revoke_records,
 			     journal->j_revoke_records_per_block);
 	handle->h_revoke_credits = revoke_records;
@@ -1477,12 +1477,12 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
 		 * of the transaction. This needs to be done
 		 * once a transaction -bzzz
 		 */
-		if (handle->h_buffer_credits <= 0) {
+		if (handle->h_total_credits <= 0) {
 			ret = -ENOSPC;
 			goto out_unlock_bh;
 		}
 		jh->b_modified = 1;
-		handle->h_buffer_credits--;
+		handle->h_total_credits--;
 	}
 
 	/*
@@ -1726,7 +1726,7 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
 drop:
 	if (drop_reserve) {
 		/* no need to reserve log space for this block -bzzz */
-		handle->h_buffer_credits++;
+		handle->h_total_credits++;
 	}
 	return err;
 
@@ -1787,7 +1787,7 @@ int jbd2_journal_stop(handle_t *handle)
 				jiffies - handle->h_start_jiffies,
 				handle->h_sync, handle->h_requested_credits,
 				(handle->h_requested_credits -
-				 handle->h_buffer_credits));
+				 handle->h_total_credits));
 
 	/*
 	 * Implement synchronous transaction batching.  If the handle
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 2a3d5f50e7a1..3115eeb44039 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -477,7 +477,8 @@ struct jbd2_revoke_table_s;
  * @h_transaction: Which compound transaction is this update a part of?
  * @h_journal: Which journal handle belongs to - used iff h_reserved set.
  * @h_rsv_handle: Handle reserved for finishing the logical operation.
- * @h_buffer_credits: Number of remaining buffers we are allowed to dirty.
+ * @h_total_credits: Number of remaining buffers we are allowed to add to
+	journal. These are dirty buffers and revoke descriptor blocks.
  * @h_revoke_credits: Number of remaining revoke records available for handle
  * @h_ref: Reference count on this handle.
  * @h_err: Field for caller's use to track errors through large fs operations.
@@ -488,7 +489,7 @@ struct jbd2_revoke_table_s;
  * @h_type: For handle statistics.
  * @h_line_no: For handle statistics.
  * @h_start_jiffies: Handle Start time.
- * @h_requested_credits: Holds @h_buffer_credits after handle is started.
+ * @h_requested_credits: Holds @h_total_credits after handle is started.
  * @h_revoke_credits_requested: Holds @h_revoke_credits after handle is started.
  * @saved_alloc_context: Saved context while transaction is open.
  **/
@@ -506,7 +507,7 @@ struct jbd2_journal_handle
 	};
 
 	handle_t		*h_rsv_handle;
-	int			h_buffer_credits;
+	int			h_total_credits;
 	int			h_revoke_credits;
 	int			h_revoke_credits_requested;
 	int			h_ref;
@@ -1652,7 +1653,7 @@ static inline int jbd2_handle_buffer_credits(handle_t *handle)
 {
 	journal_t *journal = handle->h_transaction->t_journal;
 
-	return handle->h_buffer_credits -
+	return handle->h_total_credits -
 		DIV_ROUND_UP(handle->h_revoke_credits_requested,
 			     journal->j_revoke_records_per_block);
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 22/25] jbd2: Make credit checking more strict
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (45 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 21/25] jbd2: Rename h_buffer_credits to h_total_credits Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 23/25] ext4: Reserve revoke credits for freed blocks Jan Kara
                   ` (4 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Make checking of available credits in jbd2_journal_dirty_metadata() more
strict. There should be always enough credits in the handle to write all
potential revoke descriptors. Also we warn in case there are not enough
credits since this is a bug in the filesystem.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 10fd802fd222..8f11b2d48ca0 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1477,7 +1477,7 @@ int jbd2_journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
 		 * of the transaction. This needs to be done
 		 * once a transaction -bzzz
 		 */
-		if (handle->h_total_credits <= 0) {
+		if (WARN_ON_ONCE(jbd2_handle_buffer_credits(handle) <= 0)) {
 			ret = -ENOSPC;
 			goto out_unlock_bh;
 		}
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 23/25] ext4: Reserve revoke credits for freed blocks
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (46 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 22/25] jbd2: Make credit checking more strict Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 24/25] jbd2: Provide trace event for handle restarts Jan Kara
                   ` (3 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

So far we have reserved only relatively high fixed amount of revoke
credits for each transaction. We over-reserved by large amount for most
cases but when freeing large directories or files with data journalling,
the fixed amount is not enough. In fact the worst case estimate is
inconveniently large (maximum extent size) for freeing of one extent.

We fix this by doing proper estimate of the amount of blocks that need
to be revoked when removing blocks from the inode due to truncate or
hole punching and otherwise reserve just a small amount of revoke
credits for each transaction to accommodate freeing of xattrs block or
so.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h              |  3 +-
 fs/ext4/ext4_jbd2.c         | 20 ++++++-----
 fs/ext4/ext4_jbd2.h         | 84 +++++++++++++++++++++++++++++++--------------
 fs/ext4/extents.c           | 27 +++++++++++----
 fs/ext4/ialloc.c            |  2 +-
 fs/ext4/indirect.c          | 12 ++++---
 fs/ext4/inode.c             |  2 +-
 fs/ext4/migrate.c           | 24 ++++++++-----
 fs/ext4/resize.c            | 16 ++++++---
 fs/ext4/xattr.c             |  4 ++-
 include/trace/events/ext4.h | 13 ++++---
 11 files changed, 140 insertions(+), 67 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 67a6fcc11182..a606d17a80b0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3296,7 +3296,8 @@ extern int ext4_swap_extents(handle_t *handle, struct inode *inode1,
 			     int mark_unwritten,int *err);
 extern int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu);
 extern int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode,
-				       int check_cred, int restart_cred);
+				       int check_cred, int restart_cred,
+				       int revoke_cred);
 
 
 /* move_extent.c */
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index b81190bee32d..d3b8cdea5df7 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -65,12 +65,14 @@ static int ext4_journal_check_start(struct super_block *sb)
 }
 
 handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
-				  int type, int blocks, int rsv_blocks)
+				  int type, int blocks, int rsv_blocks,
+				  int revoke_creds)
 {
 	journal_t *journal;
 	int err;
 
-	trace_ext4_journal_start(sb, blocks, rsv_blocks, _RET_IP_);
+	trace_ext4_journal_start(sb, blocks, rsv_blocks, revoke_creds,
+				 _RET_IP_);
 	err = ext4_journal_check_start(sb);
 	if (err < 0)
 		return ERR_PTR(err);
@@ -78,8 +80,8 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
 	journal = EXT4_SB(sb)->s_journal;
 	if (!journal)
 		return ext4_get_nojournal();
-	return jbd2__journal_start(journal, blocks, rsv_blocks, 1024, GFP_NOFS,
-				   type, line);
+	return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds,
+				   GFP_NOFS, type, line);
 }
 
 int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle)
@@ -134,14 +136,16 @@ handle_t *__ext4_journal_start_reserved(handle_t *handle, unsigned int line,
 }
 
 int __ext4_journal_ensure_credits(handle_t *handle, int check_cred,
-				  int extend_cred)
+				  int extend_cred, int revoke_cred)
 {
 	if (!ext4_handle_valid(handle))
 		return 0;
-	if (jbd2_handle_buffer_credits(handle) >= check_cred)
+	if (jbd2_handle_buffer_credits(handle) >= check_cred &&
+	    handle->h_revoke_credits >= revoke_cred)
 		return 0;
-	return ext4_journal_extend(handle,
-			   extend_cred - jbd2_handle_buffer_credits(handle));
+	extend_cred = max(0, extend_cred - jbd2_handle_buffer_credits(handle));
+	revoke_cred = max(0, revoke_cred - handle->h_revoke_credits);
+	return ext4_journal_extend(handle, extend_cred, revoke_cred);
 }
 
 static void ext4_journal_abort_handle(const char *caller, unsigned int line,
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index aca05e52e317..a6b9b66dbfad 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -261,7 +261,8 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
 	__ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb))
 
 handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
-				  int type, int blocks, int rsv_blocks);
+				  int type, int blocks, int rsv_blocks,
+				  int revoke_creds);
 int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle);
 
 #define EXT4_NOJOURNAL_MAX_REF_COUNT ((unsigned long) 4096)
@@ -288,21 +289,41 @@ static inline int ext4_handle_is_aborted(handle_t *handle)
 	return 0;
 }
 
+static inline int ext4_free_metadata_revoke_credits(struct super_block *sb,
+						    int blocks)
+{
+	/* Freeing each metadata block can result in freeing one cluster */
+	return blocks * EXT4_SB(sb)->s_cluster_ratio;
+}
+
+static inline int ext4_trans_default_revoke_credits(struct super_block *sb)
+{
+	return ext4_free_metadata_revoke_credits(sb, 8);
+}
+
 #define ext4_journal_start_sb(sb, type, nblocks)			\
-	__ext4_journal_start_sb((sb), __LINE__, (type), (nblocks), 0)
+	__ext4_journal_start_sb((sb), __LINE__, (type), (nblocks), 0,	\
+				ext4_trans_default_revoke_credits(sb))
 
 #define ext4_journal_start(inode, type, nblocks)			\
-	__ext4_journal_start((inode), __LINE__, (type), (nblocks), 0)
+	__ext4_journal_start((inode), __LINE__, (type), (nblocks), 0,	\
+			     ext4_trans_default_revoke_credits((inode)->i_sb))
 
-#define ext4_journal_start_with_reserve(inode, type, blocks, rsv_blocks) \
-	__ext4_journal_start((inode), __LINE__, (type), (blocks), (rsv_blocks))
+#define ext4_journal_start_with_reserve(inode, type, blocks, rsv_blocks)\
+	__ext4_journal_start((inode), __LINE__, (type), (blocks), (rsv_blocks),\
+			     ext4_trans_default_revoke_credits((inode)->i_sb))
+
+#define ext4_journal_start_with_revoke(inode, type, blocks, revoke_creds) \
+	__ext4_journal_start((inode), __LINE__, (type), (blocks), 0,	\
+			     (revoke_creds))
 
 static inline handle_t *__ext4_journal_start(struct inode *inode,
 					     unsigned int line, int type,
-					     int blocks, int rsv_blocks)
+					     int blocks, int rsv_blocks,
+					     int revoke_creds)
 {
 	return __ext4_journal_start_sb(inode->i_sb, line, type, blocks,
-				       rsv_blocks);
+				       rsv_blocks, revoke_creds);
 }
 
 #define ext4_journal_stop(handle) \
@@ -325,22 +346,23 @@ static inline handle_t *ext4_journal_current_handle(void)
 	return journal_current_handle();
 }
 
-static inline int ext4_journal_extend(handle_t *handle, int nblocks)
+static inline int ext4_journal_extend(handle_t *handle, int nblocks, int revoke)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2_journal_extend(handle, nblocks, 1024);
+		return jbd2_journal_extend(handle, nblocks, revoke);
 	return 0;
 }
 
-static inline int ext4_journal_restart(handle_t *handle, int nblocks)
+static inline int ext4_journal_restart(handle_t *handle, int nblocks,
+				       int revoke)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2__journal_restart(handle, nblocks, 1024, GFP_NOFS);
+		return jbd2__journal_restart(handle, nblocks, revoke, GFP_NOFS);
 	return 0;
 }
 
 int __ext4_journal_ensure_credits(handle_t *handle, int check_cred,
-				  int extend_cred);
+				  int extend_cred, int revoke_cred);
 
 
 /*
@@ -353,18 +375,19 @@ int __ext4_journal_ensure_credits(handle_t *handle, int check_cred,
  * credits or transaction extension succeeded, 1 in case transaction had to be
  * restarted.
  */
-#define ext4_journal_ensure_credits_fn(handle, check_cred, extend_cred, fn) \
+#define ext4_journal_ensure_credits_fn(handle, check_cred, extend_cred,	\
+				       revoke_cred, fn) \
 ({									\
 	__label__ __ensure_end;						\
 	int err = __ext4_journal_ensure_credits((handle), (check_cred),	\
-						(extend_cred));		\
+					(extend_cred), (revoke_cred));	\
 									\
 	if (err <= 0)							\
 		goto __ensure_end;					\
 	err = (fn);							\
 	if (err < 0)							\
 		goto __ensure_end;					\
-	err = ext4_journal_restart((handle), (extend_cred));		\
+	err = ext4_journal_restart((handle), (extend_cred), (revoke_cred)); \
 	if (err == 0)							\
 		err = 1;						\
 __ensure_end:								\
@@ -373,18 +396,16 @@ __ensure_end:								\
 
 /*
  * Ensure given handle has at least requested amount of credits available,
- * possibly restarting transaction if needed.
+ * possibly restarting transaction if needed. We also make sure the transaction
+ * has space for at least ext4_trans_default_revoke_credits(sb) revoke records
+ * as freeing one or two blocks is very common pattern and requesting this is
+ * very cheap.
  */
-static inline int ext4_journal_ensure_credits(handle_t *handle, int credits)
+static inline int ext4_journal_ensure_credits(handle_t *handle, int credits,
+					      int revoke_creds)
 {
-	return ext4_journal_ensure_credits_fn(handle, credits, credits, 0);
-}
-
-static inline int ext4_journal_ensure_credits_batch(handle_t *handle,
-						    int credits)
-{
-	return ext4_journal_ensure_credits_fn(handle, credits,
-					      EXT4_MAX_TRANS_DATA, 0);
+	return ext4_journal_ensure_credits_fn(handle, credits, credits,
+				revoke_creds, 0);
 }
 
 static inline int ext4_journal_blocks_per_page(struct inode *inode)
@@ -479,6 +500,19 @@ static inline int ext4_should_writeback_data(struct inode *inode)
 	return ext4_inode_journal_mode(inode) & EXT4_INODE_WRITEBACK_DATA_MODE;
 }
 
+static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
+{
+	if (test_opt(inode->i_sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
+		return 0;
+	if (!ext4_should_journal_data(inode))
+		return 0;
+	/*
+	 * Data blocks in one extent are contiguous, just account for partial
+	 * clusters at extent boundaries
+	 */
+	return blocks + 2*(EXT4_SB(inode->i_sb)->s_cluster_ratio - 1);
+}
+
 /*
  * This function controls whether or not we should try to go down the
  * dioread_nolock code paths, which makes it safe to avoid taking
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 32f2c22c7ef2..ed28b21b826d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -124,13 +124,14 @@ static int ext4_ext_trunc_restart_fn(struct inode *inode, int *dropped)
  * and < 0 in case of fatal error.
  */
 int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode,
-				int check_cred, int restart_cred)
+				int check_cred, int restart_cred,
+				int revoke_cred)
 {
 	int ret;
 	int dropped = 0;
 
 	ret = ext4_journal_ensure_credits_fn(handle, check_cred, restart_cred,
-			ext4_ext_trunc_restart_fn(inode, &dropped));
+		revoke_cred, ext4_ext_trunc_restart_fn(inode, &dropped));
 	if (dropped)
 		down_write(&EXT4_I(inode)->i_data_sem);
 	return ret;
@@ -1851,7 +1852,8 @@ static void ext4_ext_try_to_merge_up(handle_t *handle,
 	 * group descriptor to release the extent tree block.  If we
 	 * can't get the journal credits, give up.
 	 */
-	if (ext4_journal_extend(handle, 2))
+	if (ext4_journal_extend(handle, 2,
+			ext4_free_metadata_revoke_credits(inode->i_sb, 1)))
 		return;
 
 	/*
@@ -2738,7 +2740,7 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
 {
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	int err = 0, correct_index = 0;
-	int depth = ext_depth(inode), credits;
+	int depth = ext_depth(inode), credits, revoke_credits;
 	struct ext4_extent_header *eh;
 	ext4_lblk_t a, b;
 	unsigned num;
@@ -2830,9 +2832,18 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
 			credits += (ext_depth(inode)) + 1;
 		}
 		credits += EXT4_MAXQUOTAS_TRANS_BLOCKS(inode->i_sb);
+		/*
+		 * We may end up freeing some index blocks and data from the
+		 * punched range. Note that partial clusters are accounted for
+		 * by ext4_free_data_revoke_credits().
+		 */
+		revoke_credits =
+			ext4_free_metadata_revoke_credits(inode->i_sb,
+							  ext_depth(inode)) +
+			ext4_free_data_revoke_credits(inode, b - a + 1);
 
 		err = ext4_datasem_ensure_credits(handle, inode, credits,
-						  credits);
+						  credits, revoke_credits);
 		if (err) {
 			if (err > 0)
 				err = -EAGAIN;
@@ -2963,7 +2974,9 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 	ext_debug("truncate since %u to %u\n", start, end);
 
 	/* probably first extent we're gonna free will be last in block */
-	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, depth + 1);
+	handle = ext4_journal_start_with_revoke(inode, EXT4_HT_TRUNCATE,
+			depth + 1,
+			ext4_free_metadata_revoke_credits(inode->i_sb, depth));
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
 
@@ -5222,7 +5235,7 @@ ext4_access_path(handle_t *handle, struct inode *inode,
 	 * groups
 	 */
 	credits = ext4_writepage_trans_blocks(inode);
-	err = ext4_datasem_ensure_credits(handle, inode, 7, credits);
+	err = ext4_datasem_ensure_credits(handle, inode, 7, credits, 0);
 	if (err < 0)
 		return err;
 
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 764ff4c56233..fa8c3c485e4b 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -927,7 +927,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 			BUG_ON(nblocks <= 0);
 			handle = __ext4_journal_start_sb(dir->i_sb, line_no,
 							 handle_type, nblocks,
-							 0);
+							 0, 0);
 			if (IS_ERR(handle)) {
 				err = PTR_ERR(handle);
 				ext4_std_error(sb, err);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 63e1d5846442..3a4ab70fe9e0 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -736,13 +736,14 @@ static int ext4_ind_trunc_restart_fn(handle_t *handle, struct inode *inode,
  */
 static int ext4_ind_truncate_ensure_credits(handle_t *handle,
 					    struct inode *inode,
-					    struct buffer_head *bh)
+					    struct buffer_head *bh,
+					    int revoke_creds)
 {
 	int ret;
 	int dropped = 0;
 
 	ret = ext4_journal_ensure_credits_fn(handle, EXT4_RESERVE_TRANS_BLOCKS,
-			ext4_blocks_for_truncate(inode),
+			ext4_blocks_for_truncate(inode), revoke_creds,
 			ext4_ind_trunc_restart_fn(handle, inode, bh, &dropped));
 	if (dropped)
 		down_write(&EXT4_I(inode)->i_data_sem);
@@ -889,7 +890,8 @@ static int ext4_clear_blocks(handle_t *handle, struct inode *inode,
 		return 1;
 	}
 
-	err = ext4_ind_truncate_ensure_credits(handle, inode, bh);
+	err = ext4_ind_truncate_ensure_credits(handle, inode, bh,
+				ext4_free_data_revoke_credits(inode, count));
 	if (err < 0)
 		goto out_err;
 
@@ -1075,7 +1077,9 @@ static void ext4_free_branches(handle_t *handle, struct inode *inode,
 			if (ext4_handle_is_aborted(handle))
 				return;
 			if (ext4_ind_truncate_ensure_credits(handle, inode,
-							     NULL) < 0)
+					NULL,
+					ext4_free_metadata_revoke_credits(
+							inode->i_sb, 1)) < 0)
 				return;
 
 			/*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e346b5171f5a..f927cb5b002b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5949,7 +5949,7 @@ static int ext4_try_to_expand_extra_isize(struct inode *inode,
 	 * force a large enough s_min_extra_isize.
 	 */
 	if (ext4_journal_extend(handle,
-				EXT4_DATA_TRANS_BLOCKS(inode->i_sb)) != 0)
+				EXT4_DATA_TRANS_BLOCKS(inode->i_sb), 0) != 0)
 		return -ENOSPC;
 
 	if (ext4_write_trylock_xattr(inode, &no_expand) == 0)
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 65f09dc9d941..89725fa42573 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -50,7 +50,7 @@ static int finish_range(handle_t *handle, struct inode *inode,
 	needed = ext4_ext_calc_credits_for_single_extent(inode,
 		    lb->last_block - lb->first_block + 1, path);
 
-	retval = ext4_datasem_ensure_credits(handle, inode, needed, needed);
+	retval = ext4_datasem_ensure_credits(handle, inode, needed, needed, 0);
 	if (retval < 0)
 		goto err_out;
 	retval = ext4_ext_insert_extent(handle, inode, &path, &newext, 0);
@@ -182,10 +182,11 @@ static int free_dind_blocks(handle_t *handle,
 	int i;
 	__le32 *tmp_idata;
 	struct buffer_head *bh;
+	struct super_block *sb = inode->i_sb;
 	unsigned long max_entries = inode->i_sb->s_blocksize >> 2;
 	int err;
 
-	bh = ext4_sb_bread(inode->i_sb, le32_to_cpu(i_data), 0);
+	bh = ext4_sb_bread(sb, le32_to_cpu(i_data), 0);
 	if (IS_ERR(bh))
 		return PTR_ERR(bh);
 
@@ -193,7 +194,8 @@ static int free_dind_blocks(handle_t *handle,
 	for (i = 0; i < max_entries; i++) {
 		if (tmp_idata[i]) {
 			err = ext4_journal_ensure_credits(handle,
-						EXT4_RESERVE_TRANS_BLOCKS);
+				EXT4_RESERVE_TRANS_BLOCKS,
+				ext4_free_metadata_revoke_credits(sb, 1));
 			if (err < 0) {
 				put_bh(bh);
 				return err;
@@ -205,7 +207,8 @@ static int free_dind_blocks(handle_t *handle,
 		}
 	}
 	put_bh(bh);
-	err = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS);
+	err = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS,
+				ext4_free_metadata_revoke_credits(sb, 1));
 	if (err < 0)
 		return err;
 	ext4_free_blocks(handle, inode, NULL, le32_to_cpu(i_data), 1,
@@ -238,7 +241,8 @@ static int free_tind_blocks(handle_t *handle,
 		}
 	}
 	put_bh(bh);
-	retval = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS);
+	retval = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS,
+			ext4_free_metadata_revoke_credits(inode->i_sb, 1));
 	if (retval < 0)
 		return retval;
 	ext4_free_blocks(handle, inode, NULL, le32_to_cpu(i_data), 1,
@@ -254,7 +258,8 @@ static int free_ind_block(handle_t *handle, struct inode *inode, __le32 *i_data)
 	/* ei->i_data[EXT4_IND_BLOCK] */
 	if (i_data[0]) {
 		retval = ext4_journal_ensure_credits(handle,
-						     EXT4_RESERVE_TRANS_BLOCKS);
+			EXT4_RESERVE_TRANS_BLOCKS,
+			ext4_free_metadata_revoke_credits(inode->i_sb, 1));
 		if (retval < 0)
 			return retval;
 		ext4_free_blocks(handle, inode, NULL,
@@ -291,7 +296,7 @@ static int ext4_ext_swap_inode_data(handle_t *handle, struct inode *inode,
 	 * One credit accounted for writing the
 	 * i_data field of the original inode
 	 */
-	retval = ext4_journal_ensure_credits(handle, 1);
+	retval = ext4_journal_ensure_credits(handle, 1, 0);
 	if (retval < 0)
 		goto err_out;
 
@@ -368,7 +373,8 @@ static int free_ext_idx(handle_t *handle, struct inode *inode,
 		}
 	}
 	put_bh(bh);
-	retval = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS);
+	retval = ext4_journal_ensure_credits(handle, EXT4_RESERVE_TRANS_BLOCKS,
+			ext4_free_metadata_revoke_credits(inode->i_sb, 1));
 	if (retval < 0)
 		return retval;
 	ext4_free_blocks(handle, inode, NULL, block, 1,
@@ -548,7 +554,7 @@ int ext4_ext_migrate(struct inode *inode)
 	}
 
 	/* We mark the tmp_inode dirty via ext4_ext_tree_init. */
-	retval = ext4_journal_ensure_credits(handle, 1);
+	retval = ext4_journal_ensure_credits(handle, 1, 0);
 	if (retval < 0)
 		goto out_stop;
 	/*
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 3e4286b3901f..a8c0f2b5b6e1 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -388,6 +388,12 @@ static struct buffer_head *bclean(handle_t *handle, struct super_block *sb,
 	return bh;
 }
 
+static int ext4_resize_ensure_credits_batch(handle_t *handle, int credits)
+{
+	return ext4_journal_ensure_credits_fn(handle, credits,
+		EXT4_MAX_TRANS_DATA, 0, 0);
+}
+
 /*
  * set_flexbg_block_bitmap() mark clusters [@first_cluster, @last_cluster] used.
  *
@@ -427,7 +433,7 @@ static int set_flexbg_block_bitmap(struct super_block *sb, handle_t *handle,
 			continue;
 		}
 
-		err = ext4_journal_ensure_credits_batch(handle, 1);
+		err = ext4_resize_ensure_credits_batch(handle, 1);
 		if (err < 0)
 			return err;
 
@@ -520,7 +526,7 @@ static int setup_new_flex_group_blocks(struct super_block *sb,
 			struct buffer_head *gdb;
 
 			ext4_debug("update backup group %#04llx\n", block);
-			err = ext4_journal_ensure_credits_batch(handle, 1);
+			err = ext4_resize_ensure_credits_batch(handle, 1);
 			if (err < 0)
 				goto out;
 
@@ -578,7 +584,7 @@ static int setup_new_flex_group_blocks(struct super_block *sb,
 
 		/* Initialize block bitmap of the @group */
 		block = group_data[i].block_bitmap;
-		err = ext4_journal_ensure_credits_batch(handle, 1);
+		err = ext4_resize_ensure_credits_batch(handle, 1);
 		if (err < 0)
 			goto out;
 
@@ -607,7 +613,7 @@ static int setup_new_flex_group_blocks(struct super_block *sb,
 
 		/* Initialize inode bitmap of the @group */
 		block = group_data[i].inode_bitmap;
-		err = ext4_journal_ensure_credits_batch(handle, 1);
+		err = ext4_resize_ensure_credits_batch(handle, 1);
 		if (err < 0)
 			goto out;
 		/* Mark unused entries in inode bitmap used */
@@ -1085,7 +1091,7 @@ static void update_backups(struct super_block *sb, sector_t blk_off, char *data,
 		ext4_fsblk_t backup_block;
 
 		/* Out of journal space, and can't get more - abort - so sad */
-		err = ext4_journal_ensure_credits_batch(handle, 1);
+		err = ext4_resize_ensure_credits_batch(handle, 1);
 		if (err < 0)
 			break;
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 48a9dbd27f43..8966a5439a22 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1155,6 +1155,7 @@ ext4_xattr_inode_dec_ref_all(handle_t *handle, struct inode *parent,
 		}
 
 		err = ext4_journal_ensure_credits_fn(handle, credits, credits,
+			ext4_free_metadata_revoke_credits(parent->i_sb, 1),
 			ext4_xattr_restart_fn(handle, parent, bh, block_csum,
 					      dirty));
 		if (err < 0) {
@@ -2841,7 +2842,8 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 	struct inode *ea_inode;
 	int error;
 
-	error = ext4_journal_ensure_credits(handle, extra_credits);
+	error = ext4_journal_ensure_credits(handle, extra_credits,
+			ext4_free_metadata_revoke_credits(inode->i_sb, 1));
 	if (error < 0) {
 		EXT4_ERROR_INODE(inode, "ensure credits (error %d)", error);
 		goto cleanup;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index d68e9e536814..182c9fe9c0e9 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -1746,15 +1746,16 @@ TRACE_EVENT(ext4_load_inode,
 
 TRACE_EVENT(ext4_journal_start,
 	TP_PROTO(struct super_block *sb, int blocks, int rsv_blocks,
-		 unsigned long IP),
+		 int revoke_creds, unsigned long IP),
 
-	TP_ARGS(sb, blocks, rsv_blocks, IP),
+	TP_ARGS(sb, blocks, rsv_blocks, revoke_creds, IP),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev			)
 		__field(unsigned long,	ip			)
 		__field(	  int,	blocks			)
 		__field(	  int,	rsv_blocks		)
+		__field(	  int,	revoke_creds		)
 	),
 
 	TP_fast_assign(
@@ -1762,11 +1763,13 @@ TRACE_EVENT(ext4_journal_start,
 		__entry->ip		 = IP;
 		__entry->blocks		 = blocks;
 		__entry->rsv_blocks	 = rsv_blocks;
+		__entry->revoke_creds	 = revoke_creds;
 	),
 
-	TP_printk("dev %d,%d blocks, %d rsv_blocks, %d caller %pS",
-		  MAJOR(__entry->dev), MINOR(__entry->dev),
-		  __entry->blocks, __entry->rsv_blocks, (void *)__entry->ip)
+	TP_printk("dev %d,%d blocks %d, rsv_blocks %d, revoke_creds %d, "
+		  "caller %pS", MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->blocks, __entry->rsv_blocks, __entry->revoke_creds,
+		  (void *)__entry->ip)
 );
 
 TRACE_EVENT(ext4_journal_start_reserved,
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 24/25] jbd2: Provide trace event for handle restarts
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (47 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 23/25] ext4: Reserve revoke credits for freed blocks Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 16:44 ` [PATCH 25/25] jbd2: Fine tune estimate of necessary descriptor blocks Jan Kara
                   ` (2 subsequent siblings)
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Provide trace event for handle restarts to ease debugging.

Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c       |  8 +++++++-
 include/trace/events/jbd2.h | 16 +++++++++++++++-
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 8f11b2d48ca0..a3374c1a3d41 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -747,6 +747,7 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int revoke_records,
 	journal_t *journal;
 	tid_t		tid;
 	int		need_to_start;
+	int		ret;
 
 	/* If we've had an abort of any type, don't even think about
 	 * actually doing the restart! */
@@ -776,7 +777,12 @@ int jbd2__journal_restart(handle_t *handle, int nblocks, int revoke_records,
 		DIV_ROUND_UP(revoke_records,
 			     journal->j_revoke_records_per_block);
 	handle->h_revoke_credits = revoke_records;
-	return start_this_handle(journal, handle, gfp_mask);
+	ret = start_this_handle(journal, handle, gfp_mask);
+	trace_jbd2_handle_restart(journal->j_fs_dev->bd_dev,
+				 ret ? 0 : handle->h_transaction->t_tid,
+				 handle->h_type, handle->h_line_no,
+				 handle->h_total_credits);
+	return ret;
 }
 EXPORT_SYMBOL(jbd2__journal_restart);
 
diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h
index 2310b259329f..d16a32867f3a 100644
--- a/include/trace/events/jbd2.h
+++ b/include/trace/events/jbd2.h
@@ -133,7 +133,7 @@ TRACE_EVENT(jbd2_submit_inode_data,
 		  (unsigned long) __entry->ino)
 );
 
-TRACE_EVENT(jbd2_handle_start,
+DECLARE_EVENT_CLASS(jbd2_handle_start_class,
 	TP_PROTO(dev_t dev, unsigned long tid, unsigned int type,
 		 unsigned int line_no, int requested_blocks),
 
@@ -161,6 +161,20 @@ TRACE_EVENT(jbd2_handle_start,
 		  __entry->type, __entry->line_no, __entry->requested_blocks)
 );
 
+DEFINE_EVENT(jbd2_handle_start_class, jbd2_handle_start,
+	TP_PROTO(dev_t dev, unsigned long tid, unsigned int type,
+		 unsigned int line_no, int requested_blocks),
+
+	TP_ARGS(dev, tid, type, line_no, requested_blocks)
+);
+
+DEFINE_EVENT(jbd2_handle_start_class, jbd2_handle_restart,
+	TP_PROTO(dev_t dev, unsigned long tid, unsigned int type,
+		 unsigned int line_no, int requested_blocks),
+
+	TP_ARGS(dev, tid, type, line_no, requested_blocks)
+);
+
 TRACE_EVENT(jbd2_handle_extend,
 	TP_PROTO(dev_t dev, unsigned long tid, unsigned int type,
 		 unsigned int line_no, int buffer_credits,
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* [PATCH 25/25] jbd2: Fine tune estimate of necessary descriptor blocks
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (48 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 24/25] jbd2: Provide trace event for handle restarts Jan Kara
@ 2019-11-05 16:44 ` Jan Kara
  2019-11-05 21:04 ` [PATCH 0/25 v3] ext4: Fix transaction overflow due to revoke descriptors Theodore Y. Ts'o
       [not found] ` <20191112220614.GA11089@mit.edu>
  51 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-05 16:44 UTC (permalink / raw)
  To: Ted Tso; +Cc: linux-ext4, Jan Kara

Currently we reserve j_max_transaction_buffers / 32 for transaction
descriptor blocks. Now that revoke descriptors are accounted for
separately this estimate is unnecessarily high and we can actually
compute much tighter estimate. In the common case of 32k journal blocks
and 4k blocksize this actually reduces the amount of reserved descriptor
blocks from 256 to ~25 which allows us to fit more real data into a
transaction.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a3374c1a3d41..a9d3a2208506 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -63,14 +63,25 @@ void jbd2_journal_free_transaction(transaction_t *transaction)
 }
 
 /*
- * We reserve t_outstanding_credits >> JBD2_CONTROL_BLOCKS_SHIFT for
- * transaction descriptor blocks.
+ * Base amount of descriptor blocks we reserve for each transaction.
  */
-#define JBD2_CONTROL_BLOCKS_SHIFT 5
-
 static int jbd2_descriptor_blocks_per_trans(journal_t *journal)
 {
-	return journal->j_max_transaction_buffers >> JBD2_CONTROL_BLOCKS_SHIFT;
+	int tag_space = journal->j_blocksize - sizeof(journal_header_t);
+	int tags_per_block;
+
+	/* Subtract UUID */
+	tag_space -= 16;
+	if (jbd2_journal_has_csum_v2or3(journal))
+		tag_space -= sizeof(struct jbd2_journal_block_tail);
+	/* Commit code leaves a slack space of 16 bytes at the end of block */
+	tags_per_block = (tag_space - 16) / journal_tag_bytes(journal);
+	/*
+	 * Revoke descriptors are accounted separately so we need to reserve
+	 * space for commit block and normal transaction descriptor blocks.
+	 */
+	return 1 + DIV_ROUND_UP(journal->j_max_transaction_buffers,
+				tags_per_block);
 }
 
 /*
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 101+ messages in thread

* Re: [PATCH 06/25] ext4: Fix credit estimate for final inode freeing
  2019-11-05 16:44 ` [PATCH 06/25] ext4: Fix credit estimate for final inode freeing Jan Kara
@ 2019-11-05 21:00   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-11-05 21:00 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, stable

On Tue, Nov 05, 2019 at 05:44:12PM +0100, Jan Kara wrote:
> @@ -252,8 +257,12 @@ void ext4_evict_inode(struct inode *inode)
>  	if (!IS_NOQUOTA(inode))
>  		extra_credits += EXT4_MAXQUOTAS_DEL_BLOCKS(inode->i_sb);
>  
> +	/*
> +	 * Block bitmap, group descriptor, and inode are accounted in both
> + 	 * ext4_blocks_for_truncate() and extra_credits. So subtract 3.
  ^^^

There was a minor whitespace nit which I fixed up in my tree here.

      	    	  	     	       - Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 0/25 v3] ext4: Fix transaction overflow due to revoke descriptors
  2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
                   ` (49 preceding siblings ...)
  2019-11-05 16:44 ` [PATCH 25/25] jbd2: Fine tune estimate of necessary descriptor blocks Jan Kara
@ 2019-11-05 21:04 ` Theodore Y. Ts'o
       [not found] ` <20191112220614.GA11089@mit.edu>
  51 siblings, 0 replies; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-11-05 21:04 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Tue, Nov 05, 2019 at 05:44:06PM +0100, Jan Kara wrote:
> Hello,
> 
> Here is v3 of this series with couple more bugs fixed. Now all failed tests Ted
> higlighted pass for me.

Thanks, I've applied this to the ext4 git tree.  Thanks for your work
on this patch series!

				- Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors
       [not found]   ` <20191113094545.GC6367@quack2.suse.cz>
@ 2019-11-14  5:26     ` Theodore Y. Ts'o
  2019-11-14  8:49       ` Jan Kara
  0 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-11-14  5:26 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4

On Wed, Nov 13, 2019 at 10:45:45AM +0100, Jan Kara wrote:
> Thanks for the heads up! I didn't do any performance testing with the jbd2
> changes specifically and our internal performance testing grid only checks
> Linus' kernel so it didn't yet even try to run that code. I'll queue some
> sqlite insert tests internally with my changes to see whether I'm able to
> reproduce. I don't have NVME disks available quickly but I guess SATA SSD
> could do the job as well...

Sorry, false alarm.  What Phoronix was testing was 5.3 versus 5.4-rcX,
using Ubuntu's bleeding-edge kernels.  It wouldn't have any of the
ext4 patches we have queued for the *next* merge window.

That being said, I wasn't able to reproduce performance delta using
upstream kernels, running on a Google Compute Engine VM, machtype
n1-highcpu-8, using a GCE Local SSD (SCSI-attached) for the first
benchmark, which I believe was the pts/sqlite benchmark using a thread
count of 1:

     Phoronix Test Suite 9.0.1
     SQLite 3.30.1
     Threads / Copies: 1
     Seconds < Lower Is Better
     5.3.0 ..................... 225 |===========================================
     5.4.0-rc3 ................. 224 |==========================================
     5.4-rc3-80-gafb2442fa429 .. 227 |===========================================
     5.4.0-rc7 ................. 223 |==========================================

     Processor: Intel Xeon (4 Cores / 8 Threads), Chipset: Intel 440FX
     82441FX PMC, Memory: 1 x 7373 MB RAM, Disk: 11GB PersistentDisk +
     403GB EphemeralDisk, Network: Red Hat Virtio device

     OS: Debian 10, Kernel: 5.4.0-rc3-xfstests (x86_64) 20191113, Compiler:
     GCC 8.3.0, File-System: ext4, System Layer: KVM

This was done using an extension to a gce-xfstests test appliance, to
which I hope to be adding an automation engine where it will kexec
into a series of kernels, run the benchmarks and then spit out the
report somewhere.  For now, the benchmarks are run manually.

(Adding commentary and click-baity titles is left as an exercise to
the reader.  :-)

						- Ted
						
P.S.  For all that I like to make snarky comments about Phoronix.com,
I have to admit Michael Larabel has done a pretty good job with his
performance test engine.  I probably would have choosen a different
implementation than PHP, and I'd have added an explicit way to specify
the file system to be tested other than mounting it on top of
/var/lib/phoronix-test-suite, and at least have the option of placing
the benchmarks' build trees and binaries in a different location than
the file system under test.

But that being said, he's collecting a decent set of benchmark tools,
and it is pretty cool that it has an automated way of collecting the
benchmark results, including the pretty graphs suitable for web
articles and conference slide decks ("and now, we turn to the rigged
benchmarks section of the presentation designed to show my new feature
in the best possible light...").

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors
  2019-11-14  5:26     ` [PATCH 0/19 " Theodore Y. Ts'o
@ 2019-11-14  8:49       ` Jan Kara
  0 siblings, 0 replies; 101+ messages in thread
From: Jan Kara @ 2019-11-14  8:49 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4

On Thu 14-11-19 00:26:52, Theodore Y. Ts'o wrote:
> On Wed, Nov 13, 2019 at 10:45:45AM +0100, Jan Kara wrote:
> > Thanks for the heads up! I didn't do any performance testing with the jbd2
> > changes specifically and our internal performance testing grid only checks
> > Linus' kernel so it didn't yet even try to run that code. I'll queue some
> > sqlite insert tests internally with my changes to see whether I'm able to
> > reproduce. I don't have NVME disks available quickly but I guess SATA SSD
> > could do the job as well...
> 
> Sorry, false alarm.  What Phoronix was testing was 5.3 versus 5.4-rcX,
> using Ubuntu's bleeding-edge kernels.  It wouldn't have any of the
> ext4 patches we have queued for the *next* merge window.

OK, thanks for looking! I've run some tests on my test setup anyway...

> That being said, I wasn't able to reproduce performance delta using
> upstream kernels, running on a Google Compute Engine VM, machtype
> n1-highcpu-8, using a GCE Local SSD (SCSI-attached) for the first
> benchmark, which I believe was the pts/sqlite benchmark using a thread
> count of 1:
> 
>      Phoronix Test Suite 9.0.1
>      SQLite 3.30.1
>      Threads / Copies: 1
>      Seconds < Lower Is Better
>      5.3.0 ..................... 225 |===========================================
>      5.4.0-rc3 ................. 224 |==========================================
>      5.4-rc3-80-gafb2442fa429 .. 227 |===========================================
>      5.4.0-rc7 ................. 223 |==========================================
> 
>      Processor: Intel Xeon (4 Cores / 8 Threads), Chipset: Intel 440FX
>      82441FX PMC, Memory: 1 x 7373 MB RAM, Disk: 11GB PersistentDisk +
>      403GB EphemeralDisk, Network: Red Hat Virtio device
> 
>      OS: Debian 10, Kernel: 5.4.0-rc3-xfstests (x86_64) 20191113, Compiler:
>      GCC 8.3.0, File-System: ext4, System Layer: KVM
> 
> This was done using an extension to a gce-xfstests test appliance, to
> which I hope to be adding an automation engine where it will kexec
> into a series of kernels, run the benchmarks and then spit out the
> report somewhere.  For now, the benchmarks are run manually.
> 
> (Adding commentary and click-baity titles is left as an exercise to
> the reader.  :-)
> 
> 						- Ted
> 						
> P.S.  For all that I like to make snarky comments about Phoronix.com,
> I have to admit Michael Larabel has done a pretty good job with his
> performance test engine.  I probably would have choosen a different
> implementation than PHP, and I'd have added an explicit way to specify
> the file system to be tested other than mounting it on top of
> /var/lib/phoronix-test-suite, and at least have the option of placing
> the benchmarks' build trees and binaries in a different location than
> the file system under test.
> 
> But that being said, he's collecting a decent set of benchmark tools,
> and it is pretty cool that it has an automated way of collecting the
> benchmark results, including the pretty graphs suitable for web
> articles and conference slide decks ("and now, we turn to the rigged
> benchmarks section of the presentation designed to show my new feature
> in the best possible light...").

Let me make a small marketing pitch mmtest [1] :) For me running the test is
just:
  * Boot the right kernel on the machine
  * Run:
   ./run-mmtests.sh -c configs/config-db-sqlite-insert-medium-ext4 \
      --no-monitor Whatever_run_name_1

Now the config file already has proper partition, fstype, mkfs opts etc.
configured so it's a bit of cheating but still :). And when I have data for
both kernels, I do:
  cd work/log
  ../../compare_kernels.sh

and get a table with the comparison of the two benchmarking runs with
averages, standard deviations, percentiles, and other more advanced
statistical stuff to distinguish signal from noise. We also have support
for gathering various monitoring while the test is running (turbostat,
iostat, vmstat, ...) and graphing all the results (although the graphs are
more aimed at quick analysis of what's going on rather than at presenting
results to a public).

So for this campaign I've compared "5.3+some SUSE patches" to "5.4-rc7+your
'dev' branch". And the results look like:

sqlite
                              5.3-SUSE                5.4-rc7
                                                     ext4-dev
Min       Trans     2181.67 (   0.00%)     2412.72 (  10.59%)
Hmean     Trans     2399.39 (   0.00%)     2602.73 *   8.47%*
Stddev    Trans      172.15 (   0.00%)      141.61 (  17.74%)
CoeffVar  Trans        7.14 (   0.00%)        5.43 (  24.00%)
Max       Trans     2671.84 (   0.00%)     3027.81 (  13.32%)
...

These are Trans/Sec values so there's actually a small improvement on this
machine. But it's somwhat difficult to tell because the benchmark variation
is rather high (likely due to powersafe cpufreq governor if I should guess).

								Honza

[1] git://github.com/gormanm/mmtests
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 20/25] jbd2: Reserve space for revoke descriptor blocks
  2019-11-05 16:44 ` [PATCH 20/25] jbd2: Reserve space for revoke descriptor blocks Jan Kara
@ 2019-11-15  7:52   ` Eric Biggers
  2019-11-15 10:02     ` Jan Kara
  0 siblings, 1 reply; 101+ messages in thread
From: Eric Biggers @ 2019-11-15  7:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ted Tso, linux-ext4

On Tue, Nov 05, 2019 at 05:44:26PM +0100, Jan Kara wrote:
>  static inline int jbd2_handle_buffer_credits(handle_t *handle)
>  {
> -	return handle->h_buffer_credits;
> +	journal_t *journal = handle->h_transaction->t_journal;
> +
> +	return handle->h_buffer_credits -
> +		DIV_ROUND_UP(handle->h_revoke_credits_requested,
> +			     journal->j_revoke_records_per_block);
>  }

This patch is causing a crash with 'kvm-xfstests -c dioread_nolock ext4/024'.
Looks like this code incorrectly assumes that h_transaction is always valid
rather than the other member of the union, h_journal.


BUG: kernel NULL pointer dereference, address: 0000000000000614
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] SMP
CPU: 1 PID: 105 Comm: kworker/u4:3 Not tainted 5.4.0-rc3-00020-gfdc3ef882a5d #18
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191013_105130-anatol 04/01/2014
Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
RIP: 0010:jbd2_handle_buffer_credits include/linux/jbd2.h:1656 [inline]
RIP: 0010:__ext4_journal_start_reserved+0x38/0x1f0 fs/ext4/ext4_jbd2.c:122
Code: 83 ec 10 48 81 ff ff 0f 00 00 89 75 d4 89 55 d0 0f 86 f5 00 00 00 48 8b 07 49 89 fc 48 8b 5d 08 4c 8b a8 40 07 00 6
RSP: 0018:ffffc90000457d40 EFLAGS: 00010296
RAX: 0000000000000038 RBX: ffffffff812e68fb RCX: 000000000000000c
RDX: 000000000000000b RSI: 000000000000137f RDI: ffff8880779c5468
RBP: ffffc90000457d78 R08: 0000000000001000 R09: 0000000000001000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8880779c5468
R13: ffff88807b726000 R14: ffff8880779ad9e8 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88807fd00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000614 CR3: 000000007a0dd000 CR4: 00000000003406e0
Call Trace:
 ext4_convert_unwritten_extents+0x8b/0x250 fs/ext4/extents.c:4991
 ext4_end_io fs/ext4/page-io.c:152 [inline]
 ext4_do_flush_completed_IO fs/ext4/page-io.c:226 [inline]
 ext4_end_io_rsv_work+0x11a/0x1f0 fs/ext4/page-io.c:240
 process_one_work+0x227/0x5b0 kernel/workqueue.c:2269
 worker_thread+0x4b/0x3c0 kernel/workqueue.c:2415
 kthread+0x125/0x140 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
CR2: 0000000000000614
---[ end trace d8eaf4e1225480d5 ]---

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 20/25] jbd2: Reserve space for revoke descriptor blocks
  2019-11-15  7:52   ` Eric Biggers
@ 2019-11-15 10:02     ` Jan Kara
  2019-11-15 14:20       ` Theodore Y. Ts'o
  0 siblings, 1 reply; 101+ messages in thread
From: Jan Kara @ 2019-11-15 10:02 UTC (permalink / raw)
  To: Eric Biggers; +Cc: Jan Kara, Ted Tso, linux-ext4

On Thu 14-11-19 23:52:23, Eric Biggers wrote:
> On Tue, Nov 05, 2019 at 05:44:26PM +0100, Jan Kara wrote:
> >  static inline int jbd2_handle_buffer_credits(handle_t *handle)
> >  {
> > -	return handle->h_buffer_credits;
> > +	journal_t *journal = handle->h_transaction->t_journal;
> > +
> > +	return handle->h_buffer_credits -
> > +		DIV_ROUND_UP(handle->h_revoke_credits_requested,
> > +			     journal->j_revoke_records_per_block);
> >  }
> 
> This patch is causing a crash with 'kvm-xfstests -c dioread_nolock ext4/024'.
> Looks like this code incorrectly assumes that h_transaction is always valid
> rather than the other member of the union, h_journal.

Right, thanks for the report! Just out of curiosity: You have to have that
tracepoint enabled for the crash to trigger, don't you? Because I'm pretty
sure I did dioread_nolock runs...

I'll send a fix shortly.

								Honza

> 
> 
> BUG: kernel NULL pointer dereference, address: 0000000000000614
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0 
> Oops: 0000 [#1] SMP
> CPU: 1 PID: 105 Comm: kworker/u4:3 Not tainted 5.4.0-rc3-00020-gfdc3ef882a5d #18
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191013_105130-anatol 04/01/2014
> Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work
> RIP: 0010:jbd2_handle_buffer_credits include/linux/jbd2.h:1656 [inline]
> RIP: 0010:__ext4_journal_start_reserved+0x38/0x1f0 fs/ext4/ext4_jbd2.c:122
> Code: 83 ec 10 48 81 ff ff 0f 00 00 89 75 d4 89 55 d0 0f 86 f5 00 00 00 48 8b 07 49 89 fc 48 8b 5d 08 4c 8b a8 40 07 00 6
> RSP: 0018:ffffc90000457d40 EFLAGS: 00010296
> RAX: 0000000000000038 RBX: ffffffff812e68fb RCX: 000000000000000c
> RDX: 000000000000000b RSI: 000000000000137f RDI: ffff8880779c5468
> RBP: ffffc90000457d78 R08: 0000000000001000 R09: 0000000000001000
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff8880779c5468
> R13: ffff88807b726000 R14: ffff8880779ad9e8 R15: 0000000000000000
> FS:  0000000000000000(0000) GS:ffff88807fd00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000614 CR3: 000000007a0dd000 CR4: 00000000003406e0
> Call Trace:
>  ext4_convert_unwritten_extents+0x8b/0x250 fs/ext4/extents.c:4991
>  ext4_end_io fs/ext4/page-io.c:152 [inline]
>  ext4_do_flush_completed_IO fs/ext4/page-io.c:226 [inline]
>  ext4_end_io_rsv_work+0x11a/0x1f0 fs/ext4/page-io.c:240
>  process_one_work+0x227/0x5b0 kernel/workqueue.c:2269
>  worker_thread+0x4b/0x3c0 kernel/workqueue.c:2415
>  kthread+0x125/0x140 kernel/kthread.c:255
>  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
> CR2: 0000000000000614
> ---[ end trace d8eaf4e1225480d5 ]---
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 20/25] jbd2: Reserve space for revoke descriptor blocks
  2019-11-15 10:02     ` Jan Kara
@ 2019-11-15 14:20       ` Theodore Y. Ts'o
  2019-11-15 17:10         ` Eric Biggers
  0 siblings, 1 reply; 101+ messages in thread
From: Theodore Y. Ts'o @ 2019-11-15 14:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: Eric Biggers, linux-ext4

On Fri, Nov 15, 2019 at 11:02:22AM +0100, Jan Kara wrote:
> On Thu 14-11-19 23:52:23, Eric Biggers wrote:
> > On Tue, Nov 05, 2019 at 05:44:26PM +0100, Jan Kara wrote:
> > >  static inline int jbd2_handle_buffer_credits(handle_t *handle)
> > >  {
> > > -	return handle->h_buffer_credits;
> > > +	journal_t *journal = handle->h_transaction->t_journal;
> > > +
> > > +	return handle->h_buffer_credits -
> > > +		DIV_ROUND_UP(handle->h_revoke_credits_requested,
> > > +			     journal->j_revoke_records_per_block);
> > >  }
> > 
> > This patch is causing a crash with 'kvm-xfstests -c dioread_nolock ext4/024'.
> > Looks like this code incorrectly assumes that h_transaction is always valid
> > rather than the other member of the union, h_journal.
> 
> Right, thanks for the report! Just out of curiosity: You have to have that
> tracepoint enabled for the crash to trigger, don't you? Because I'm pretty
> sure I did dioread_nolock runs...

I've been *definitely* been doing dioread_nolock runs (including two
last night), with no failures.

ext4/dioread_nolock: 485 tests, 40 skipped, 5142 seconds

		     	 	   	    - Ted

^ permalink raw reply	[flat|nested] 101+ messages in thread

* Re: [PATCH 20/25] jbd2: Reserve space for revoke descriptor blocks
  2019-11-15 14:20       ` Theodore Y. Ts'o
@ 2019-11-15 17:10         ` Eric Biggers
  0 siblings, 0 replies; 101+ messages in thread
From: Eric Biggers @ 2019-11-15 17:10 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Jan Kara, linux-ext4

On Fri, Nov 15, 2019 at 09:20:33AM -0500, Theodore Y. Ts'o wrote:
> On Fri, Nov 15, 2019 at 11:02:22AM +0100, Jan Kara wrote:
> > On Thu 14-11-19 23:52:23, Eric Biggers wrote:
> > > On Tue, Nov 05, 2019 at 05:44:26PM +0100, Jan Kara wrote:
> > > >  static inline int jbd2_handle_buffer_credits(handle_t *handle)
> > > >  {
> > > > -	return handle->h_buffer_credits;
> > > > +	journal_t *journal = handle->h_transaction->t_journal;
> > > > +
> > > > +	return handle->h_buffer_credits -
> > > > +		DIV_ROUND_UP(handle->h_revoke_credits_requested,
> > > > +			     journal->j_revoke_records_per_block);
> > > >  }
> > > 
> > > This patch is causing a crash with 'kvm-xfstests -c dioread_nolock ext4/024'.
> > > Looks like this code incorrectly assumes that h_transaction is always valid
> > > rather than the other member of the union, h_journal.
> > 
> > Right, thanks for the report! Just out of curiosity: You have to have that
> > tracepoint enabled for the crash to trigger, don't you? Because I'm pretty
> > sure I did dioread_nolock runs...
> 
> I've been *definitely* been doing dioread_nolock runs (including two
> last night), with no failures.
> 
> ext4/dioread_nolock: 485 tests, 40 skipped, 5142 seconds
> 

No I didn't enable the tracepoint.  I think the difference is that I had
CONFIG_UBSAN enabled.  I get the crash if I use the following kconfig:

	curl -o .config 'https://git.kernel.org/pub/scm/fs/ext2/xfstests-bld.git/plain/kernel-configs/x86_64-config-5.4'
	echo CONFIG_UBSAN=y >> .config
	make olddefconfig

... but not if I don't enable UBSAN.

No idea why UBSAN makes a difference here, though.  I'm using gcc 9.2.0.

- Eric

^ permalink raw reply	[flat|nested] 101+ messages in thread

end of thread, other threads:[~2019-11-15 17:10 UTC | newest]

Thread overview: 101+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-03 22:05 [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Jan Kara
2019-10-03 22:05 ` [PATCH 01/22] jbd2: Fix possible overflow in jbd2_log_space_left() Jan Kara
2019-10-21  1:08   ` Theodore Y. Ts'o
2019-10-03 22:05 ` [PATCH 02/22] jbd2: Fixup stale comment in commit code Jan Kara
2019-10-21  1:08   ` Theodore Y. Ts'o
2019-10-03 22:05 ` [PATCH 03/22] ext4: Do not iput inode under running transaction in ext4_mkdir() Jan Kara
2019-10-21  1:21   ` Theodore Y. Ts'o
2019-10-24 10:19     ` Jan Kara
2019-10-24 12:09       ` Theodore Y. Ts'o
2019-10-24 13:37         ` Jan Kara
2019-11-04 12:35           ` Theodore Y. Ts'o
2019-10-03 22:05 ` [PATCH 04/22] ext4: Fix credit estimate for final inode freeing Jan Kara
2019-10-21  1:07   ` Theodore Y. Ts'o
2019-10-24 10:30     ` Jan Kara
2019-10-03 22:05 ` [PATCH 05/22] ext4: Fix ext4_should_journal_data() for EA inodes Jan Kara
2019-10-21  1:38   ` Theodore Y. Ts'o
2019-10-23 16:55     ` Jan Kara
2019-10-03 22:05 ` [PATCH 06/22] ext4: Use ext4_journal_extend() instead of jbd2_journal_extend() Jan Kara
2019-10-21  1:39   ` Theodore Y. Ts'o
2019-10-03 22:05 ` [PATCH 07/22] ext4: Avoid unnecessary revokes in ext4_alloc_branch() Jan Kara
2019-10-21 13:39   ` Theodore Y. Ts'o
2019-10-03 22:05 ` [PATCH 08/22] ext4: Provide function to handle transaction restarts Jan Kara
2019-10-21 16:20   ` Theodore Y. Ts'o
2019-10-23 16:25     ` Jan Kara
2019-10-03 22:05 ` [PATCH 09/22] ext4, jbd2: Provide accessor function for handle credits Jan Kara
2019-10-21 16:21   ` Theodore Y. Ts'o
2019-10-03 22:05 ` [PATCH 10/22] ocfs2: Use accessor function for h_buffer_credits Jan Kara
2019-10-21 16:21   ` Theodore Y. Ts'o
2019-10-03 22:05 ` [PATCH 11/22] jbd2: Fix statistics for the number of logged blocks Jan Kara
2019-10-21 16:24   ` Theodore Y. Ts'o
2019-10-03 22:05 ` [PATCH 12/22] jbd2: Reorganize jbd2_journal_stop() Jan Kara
2019-10-21 17:29   ` Theodore Y. Ts'o
2019-10-03 22:05 ` [PATCH 13/22] jbd2: Drop pointless check from jbd2_journal_stop() Jan Kara
2019-10-21 17:30   ` Theodore Y. Ts'o
2019-10-03 22:06 ` [PATCH 14/22] jbd2: Drop pointless wakeup " Jan Kara
2019-10-21 17:34   ` Theodore Y. Ts'o
2019-10-03 22:06 ` [PATCH 15/22] jbd2: Factor out common parts of stopping and restarting a handle Jan Kara
2019-10-21 17:49   ` Theodore Y. Ts'o
2019-10-23 16:17     ` Jan Kara
2019-11-04 12:36       ` Theodore Y. Ts'o
2019-11-04 12:59         ` Jan Kara
2019-10-03 22:06 ` [PATCH 16/22] jbd2: Account descriptor blocks into t_outstanding_credits Jan Kara
2019-10-21 21:04   ` Theodore Y. Ts'o
2019-10-23 13:09     ` Jan Kara
2019-10-03 22:06 ` [PATCH 17/22] jbd2: Drop jbd2_space_needed() Jan Kara
2019-10-21 21:05   ` Theodore Y. Ts'o
2019-10-03 22:06 ` [PATCH 18/22] jbd2: Reserve space for revoke descriptor blocks Jan Kara
2019-10-21 21:47   ` Theodore Y. Ts'o
2019-10-23 13:27     ` Jan Kara
2019-10-03 22:06 ` [PATCH 19/22] jbd2: Rename h_buffer_credits to h_total_credits Jan Kara
2019-10-21 21:48   ` Theodore Y. Ts'o
2019-10-03 22:06 ` [PATCH 20/22] jbd2: Make credit checking more strict Jan Kara
2019-10-21 22:29   ` Theodore Y. Ts'o
2019-10-23 13:30     ` Jan Kara
2019-10-03 22:06 ` [PATCH 21/22] ext4: Reserve revoke credits for freed blocks Jan Kara
2019-10-21 23:18   ` Theodore Y. Ts'o
2019-10-23 16:13     ` Jan Kara
2019-11-04 13:08       ` Theodore Y. Ts'o
2019-11-05  8:31         ` Jan Kara
2019-10-03 22:06 ` [PATCH 22/22] jbd2: Provide trace event for handle restarts Jan Kara
2019-10-21 23:18   ` Theodore Y. Ts'o
2019-10-19 19:19 ` [PATCH 0/19 v3] ext4: Fix transaction overflow due to revoke descriptors Theodore Y. Ts'o
2019-10-24 13:09   ` Jan Kara
2019-10-24 15:12     ` Jan Kara
2019-11-04  3:32 ` Theodore Y. Ts'o
2019-11-04 11:22   ` Jan Kara
2019-11-04 13:09     ` Theodore Y. Ts'o
2019-11-05 16:44 ` [PATCH 0/25 " Jan Kara
2019-11-05 16:44 ` [PATCH 01/25] jbd2: Fix possible overflow in jbd2_log_space_left() Jan Kara
2019-11-05 16:44 ` [PATCH 02/25] jbd2: Fixup stale comment in commit code Jan Kara
2019-11-05 16:44 ` [PATCH 03/25] jbd2: Completely fill journal descriptor blocks Jan Kara
2019-11-05 16:44 ` [PATCH 04/25] ext4: Move marking of handle as sync to ext4_add_nondir() Jan Kara
2019-11-05 16:44 ` [PATCH 05/25] ext4: Do not iput inode under running transaction Jan Kara
2019-11-05 16:44 ` [PATCH 06/25] ext4: Fix credit estimate for final inode freeing Jan Kara
2019-11-05 21:00   ` Theodore Y. Ts'o
2019-11-05 16:44 ` [PATCH 07/25] ext4: Fix ext4_should_journal_data() for EA inodes Jan Kara
2019-11-05 16:44 ` [PATCH 08/25] ext4: Use ext4_journal_extend() instead of jbd2_journal_extend() Jan Kara
2019-11-05 16:44 ` [PATCH 09/25] ext4: Avoid unnecessary revokes in ext4_alloc_branch() Jan Kara
2019-11-05 16:44 ` [PATCH 10/25] ext4: Provide function to handle transaction restarts Jan Kara
2019-11-05 16:44 ` [PATCH 11/25] ext4, jbd2: Provide accessor function for handle credits Jan Kara
2019-11-05 16:44 ` [PATCH 12/25] ocfs2: Use accessor function for h_buffer_credits Jan Kara
2019-11-05 16:44 ` [PATCH 13/25] jbd2: Fix statistics for the number of logged blocks Jan Kara
2019-11-05 16:44 ` [PATCH 14/25] jbd2: Reorganize jbd2_journal_stop() Jan Kara
2019-11-05 16:44 ` [PATCH 15/25] jbd2: Drop pointless check from jbd2_journal_stop() Jan Kara
2019-11-05 16:44 ` [PATCH 16/25] jbd2: Drop pointless wakeup " Jan Kara
2019-11-05 16:44 ` [PATCH 17/25] jbd2: Factor out common parts of stopping and restarting a handle Jan Kara
2019-11-05 16:44 ` [PATCH 18/25] jbd2: Account descriptor blocks into t_outstanding_credits Jan Kara
2019-11-05 16:44 ` [PATCH 19/25] jbd2: Drop jbd2_space_needed() Jan Kara
2019-11-05 16:44 ` [PATCH 20/25] jbd2: Reserve space for revoke descriptor blocks Jan Kara
2019-11-15  7:52   ` Eric Biggers
2019-11-15 10:02     ` Jan Kara
2019-11-15 14:20       ` Theodore Y. Ts'o
2019-11-15 17:10         ` Eric Biggers
2019-11-05 16:44 ` [PATCH 21/25] jbd2: Rename h_buffer_credits to h_total_credits Jan Kara
2019-11-05 16:44 ` [PATCH 22/25] jbd2: Make credit checking more strict Jan Kara
2019-11-05 16:44 ` [PATCH 23/25] ext4: Reserve revoke credits for freed blocks Jan Kara
2019-11-05 16:44 ` [PATCH 24/25] jbd2: Provide trace event for handle restarts Jan Kara
2019-11-05 16:44 ` [PATCH 25/25] jbd2: Fine tune estimate of necessary descriptor blocks Jan Kara
2019-11-05 21:04 ` [PATCH 0/25 v3] ext4: Fix transaction overflow due to revoke descriptors Theodore Y. Ts'o
     [not found] ` <20191112220614.GA11089@mit.edu>
     [not found]   ` <20191113094545.GC6367@quack2.suse.cz>
2019-11-14  5:26     ` [PATCH 0/19 " Theodore Y. Ts'o
2019-11-14  8:49       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).