linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Add dirty range scoping to jbd2
@ 2019-06-19 17:21 Ross Zwisler
  2019-06-19 17:21 ` [PATCH 1/3] mm: add filemap_fdatawait_range_keep_errors() Ross Zwisler
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Ross Zwisler @ 2019-06-19 17:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Jan Kara, linux-ext4, linux-fsdevel, linux-mm, Fletcher Woodruff,
	Justin TerAvest

This patch series fixes the issue I described here:

https://www.spinics.net/lists/linux-block/msg38274.html

Essentially the issue is that journal_finish_inode_data_buffers() operates
on the entire address space of each of the inodes associated with a given
journal entry.  This means that if we have an inode where we are constantly
appending dirty pages we can end up waiting for an indefinite amount of
time in journal_finish_inode_data_buffers().

This series improves this situation in ext4 by scoping each of the inode
dirty ranges associated with a given transaction.  Other users of jbd2
which don't (yet?) take advantage of this scoping (ocfs2) will continue to
have the old behavior.

Ross Zwisler (3):
  mm: add filemap_fdatawait_range_keep_errors()
  jbd2: introduce jbd2_inode dirty range scoping
  ext4: use jbd2_inode dirty range scoping

 fs/ext4/ext4_jbd2.h   | 12 +++++------
 fs/ext4/inode.c       | 13 +++++++++---
 fs/ext4/move_extent.c |  3 ++-
 fs/jbd2/commit.c      | 26 +++++++++++++++++------
 fs/jbd2/journal.c     |  2 ++
 fs/jbd2/transaction.c | 49 ++++++++++++++++++++++++-------------------
 include/linux/fs.h    |  2 ++
 include/linux/jbd2.h  | 22 +++++++++++++++++++
 mm/filemap.c          | 22 +++++++++++++++++++
 9 files changed, 114 insertions(+), 37 deletions(-)

-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] mm: add filemap_fdatawait_range_keep_errors()
  2019-06-19 17:21 [PATCH 0/3] Add dirty range scoping to jbd2 Ross Zwisler
@ 2019-06-19 17:21 ` Ross Zwisler
  2019-06-20  9:25   ` Jan Kara
  2019-06-19 17:21 ` [PATCH 2/3] jbd2: introduce jbd2_inode dirty range scoping Ross Zwisler
  2019-06-19 17:21 ` [PATCH 3/3] ext4: use " Ross Zwisler
  2 siblings, 1 reply; 9+ messages in thread
From: Ross Zwisler @ 2019-06-19 17:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Jan Kara, linux-ext4, linux-fsdevel, linux-mm, Fletcher Woodruff,
	Justin TerAvest

In the spirit of filemap_fdatawait_range() and
filemap_fdatawait_keep_errors(), introduce
filemap_fdatawait_range_keep_errors() which both takes a range upon
which to wait and does not clear errors from the address space.

Signed-off-by: Ross Zwisler <zwisler@google.com>
---
 include/linux/fs.h |  2 ++
 mm/filemap.c       | 22 ++++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f7fdfe93e25d3..79fec8a8413f4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2712,6 +2712,8 @@ extern int filemap_flush(struct address_space *);
 extern int filemap_fdatawait_keep_errors(struct address_space *mapping);
 extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
 				   loff_t lend);
+extern int filemap_fdatawait_range_keep_errors(struct address_space *mapping,
+		loff_t start_byte, loff_t end_byte);
 
 static inline int filemap_fdatawait(struct address_space *mapping)
 {
diff --git a/mm/filemap.c b/mm/filemap.c
index df2006ba0cfa5..e87252ca0835a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -553,6 +553,28 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
 }
 EXPORT_SYMBOL(filemap_fdatawait_range);
 
+/**
+ * filemap_fdatawait_range_keep_errors - wait for writeback to complete
+ * @mapping:		address space structure to wait for
+ * @start_byte:		offset in bytes where the range starts
+ * @end_byte:		offset in bytes where the range ends (inclusive)
+ *
+ * Walk the list of under-writeback pages of the given address space in the
+ * given range and wait for all of them.  Unlike filemap_fdatawait_range(),
+ * this function does not clear error status of the address space.
+ *
+ * Use this function if callers don't handle errors themselves.  Expected
+ * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2),
+ * fsfreeze(8)
+ */
+int filemap_fdatawait_range_keep_errors(struct address_space *mapping,
+		loff_t start_byte, loff_t end_byte)
+{
+	__filemap_fdatawait_range(mapping, start_byte, end_byte);
+	return filemap_check_and_keep_errors(mapping);
+}
+EXPORT_SYMBOL(filemap_fdatawait_range_keep_errors);
+
 /**
  * file_fdatawait_range - wait for writeback to complete
  * @file:		file pointing to address space structure to wait for
-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] jbd2: introduce jbd2_inode dirty range scoping
  2019-06-19 17:21 [PATCH 0/3] Add dirty range scoping to jbd2 Ross Zwisler
  2019-06-19 17:21 ` [PATCH 1/3] mm: add filemap_fdatawait_range_keep_errors() Ross Zwisler
@ 2019-06-19 17:21 ` Ross Zwisler
  2019-06-20 11:04   ` Jan Kara
  2019-06-19 17:21 ` [PATCH 3/3] ext4: use " Ross Zwisler
  2 siblings, 1 reply; 9+ messages in thread
From: Ross Zwisler @ 2019-06-19 17:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Jan Kara, linux-ext4, linux-fsdevel, linux-mm, Fletcher Woodruff,
	Justin TerAvest

Currently both journal_submit_inode_data_buffers() and
journal_finish_inode_data_buffers() operate on the entire address space
of each of the inodes associated with a given journal entry.  The
consequence of this is that if we have an inode where we are constantly
appending dirty pages we can end up waiting for an indefinite amount of
time in journal_finish_inode_data_buffers() while we wait for all the
pages under writeback to be written out.

The easiest way to cause this type of workload is do just dd from
/dev/zero to a file until it fills the entire filesystem.  This can
cause journal_finish_inode_data_buffers() to wait for the duration of
the entire dd operation.

We can improve this situation by scoping each of the inode dirty ranges
associated with a given transaction.  We do this via the jbd2_inode
structure so that the scoping is contained within jbd2 and so that it
follows the lifetime and locking rules for that structure.

This allows us to limit the writeback & wait in
journal_submit_inode_data_buffers() and
journal_finish_inode_data_buffers() respectively to the dirty range for
a given struct jdb2_inode, keeping us from waiting forever if the inode
in question is still being appended to.

Signed-off-by: Ross Zwisler <zwisler@google.com>
---
 fs/jbd2/commit.c      | 26 +++++++++++++++++------
 fs/jbd2/journal.c     |  2 ++
 fs/jbd2/transaction.c | 49 ++++++++++++++++++++++++-------------------
 include/linux/jbd2.h  | 22 +++++++++++++++++++
 4 files changed, 72 insertions(+), 27 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index efd0ce9489ae9..b4b99ea6e8700 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -187,14 +187,15 @@ static int journal_wait_on_commit_record(journal_t *journal,
  * use writepages() because with dealyed allocation we may be doing
  * block allocation in writepages().
  */
-static int journal_submit_inode_data_buffers(struct address_space *mapping)
+static int journal_submit_inode_data_buffers(struct address_space *mapping,
+		loff_t dirty_start, loff_t dirty_end)
 {
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode =  WB_SYNC_ALL,
 		.nr_to_write = mapping->nrpages * 2,
-		.range_start = 0,
-		.range_end = i_size_read(mapping->host),
+		.range_start = dirty_start,
+		.range_end = dirty_end,
 	};
 
 	ret = generic_writepages(mapping, &wbc);
@@ -218,6 +219,9 @@ static int journal_submit_data_buffers(journal_t *journal,
 
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
+		loff_t dirty_start = jinode->i_dirty_start;
+		loff_t dirty_end = jinode->i_dirty_end;
+
 		if (!(jinode->i_flags & JI_WRITE_DATA))
 			continue;
 		mapping = jinode->i_vfs_inode->i_mapping;
@@ -230,7 +234,8 @@ static int journal_submit_data_buffers(journal_t *journal,
 		 * only allocated blocks here.
 		 */
 		trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
-		err = journal_submit_inode_data_buffers(mapping);
+		err = journal_submit_inode_data_buffers(mapping, dirty_start,
+				dirty_end);
 		if (!ret)
 			ret = err;
 		spin_lock(&journal->j_list_lock);
@@ -257,15 +262,24 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
 	/* For locking, see the comment in journal_submit_data_buffers() */
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
+		loff_t dirty_start = jinode->i_dirty_start;
+		loff_t dirty_end = jinode->i_dirty_end;
+
 		if (!(jinode->i_flags & JI_WAIT_DATA))
 			continue;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
-		err = filemap_fdatawait_keep_errors(
-				jinode->i_vfs_inode->i_mapping);
+		err = filemap_fdatawait_range_keep_errors(
+				jinode->i_vfs_inode->i_mapping, dirty_start,
+				dirty_end);
 		if (!ret)
 			ret = err;
 		spin_lock(&journal->j_list_lock);
+
+		if (!jinode->i_next_transaction) {
+			jinode->i_dirty_start = 0;
+			jinode->i_dirty_end = 0;
+		}
 		jinode->i_flags &= ~JI_COMMIT_RUNNING;
 		smp_mb();
 		wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 43df0c943229c..288b8e7cf21c7 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2574,6 +2574,8 @@ void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode)
 	jinode->i_next_transaction = NULL;
 	jinode->i_vfs_inode = inode;
 	jinode->i_flags = 0;
+	jinode->i_dirty_start = 0;
+	jinode->i_dirty_end = 0;
 	INIT_LIST_HEAD(&jinode->i_list);
 }
 
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 8ca4fddc705fe..990e7b5062e74 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2565,7 +2565,7 @@ void jbd2_journal_refile_buffer(journal_t *journal, struct journal_head *jh)
  * File inode in the inode list of the handle's transaction
  */
 static int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode,
-				   unsigned long flags)
+		unsigned long flags, loff_t start_byte, loff_t end_byte)
 {
 	transaction_t *transaction = handle->h_transaction;
 	journal_t *journal;
@@ -2577,26 +2577,17 @@ static int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode,
 	jbd_debug(4, "Adding inode %lu, tid:%d\n", jinode->i_vfs_inode->i_ino,
 			transaction->t_tid);
 
-	/*
-	 * First check whether inode isn't already on the transaction's
-	 * lists without taking the lock. Note that this check is safe
-	 * without the lock as we cannot race with somebody removing inode
-	 * from the transaction. The reason is that we remove inode from the
-	 * transaction only in journal_release_jbd_inode() and when we commit
-	 * the transaction. We are guarded from the first case by holding
-	 * a reference to the inode. We are safe against the second case
-	 * because if jinode->i_transaction == transaction, commit code
-	 * cannot touch the transaction because we hold reference to it,
-	 * and if jinode->i_next_transaction == transaction, commit code
-	 * will only file the inode where we want it.
-	 */
-	if ((jinode->i_transaction == transaction ||
-	    jinode->i_next_transaction == transaction) &&
-	    (jinode->i_flags & flags) == flags)
-		return 0;
-
 	spin_lock(&journal->j_list_lock);
 	jinode->i_flags |= flags;
+
+	if (jinode->i_dirty_end) {
+		jinode->i_dirty_start = min(jinode->i_dirty_start, start_byte);
+		jinode->i_dirty_end = max(jinode->i_dirty_end, end_byte);
+	} else {
+		jinode->i_dirty_start = start_byte;
+		jinode->i_dirty_end = end_byte;
+	}
+
 	/* Is inode already attached where we need it? */
 	if (jinode->i_transaction == transaction ||
 	    jinode->i_next_transaction == transaction)
@@ -2631,12 +2622,28 @@ static int jbd2_journal_file_inode(handle_t *handle, struct jbd2_inode *jinode,
 int jbd2_journal_inode_add_write(handle_t *handle, struct jbd2_inode *jinode)
 {
 	return jbd2_journal_file_inode(handle, jinode,
-				       JI_WRITE_DATA | JI_WAIT_DATA);
+			JI_WRITE_DATA | JI_WAIT_DATA, 0, LLONG_MAX);
 }
 
 int jbd2_journal_inode_add_wait(handle_t *handle, struct jbd2_inode *jinode)
 {
-	return jbd2_journal_file_inode(handle, jinode, JI_WAIT_DATA);
+	return jbd2_journal_file_inode(handle, jinode, JI_WAIT_DATA, 0,
+			LLONG_MAX);
+}
+
+int jbd2_journal_inode_ranged_write(handle_t *handle,
+		struct jbd2_inode *jinode, loff_t start_byte, loff_t length)
+{
+	return jbd2_journal_file_inode(handle, jinode,
+			JI_WRITE_DATA | JI_WAIT_DATA, start_byte,
+			start_byte + length - 1);
+}
+
+int jbd2_journal_inode_ranged_wait(handle_t *handle, struct jbd2_inode *jinode,
+		loff_t start_byte, loff_t length)
+{
+	return jbd2_journal_file_inode(handle, jinode, JI_WAIT_DATA,
+			start_byte, start_byte + length - 1);
 }
 
 /*
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 5c04181b7c6d8..0e0393e7f41a4 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -451,6 +451,22 @@ struct jbd2_inode {
 	 * @i_flags: Flags of inode [j_list_lock]
 	 */
 	unsigned long i_flags;
+
+	/**
+	 * @i_dirty_start:
+	 *
+	 * Offset in bytes where the dirty range for this inode starts.
+	 * [j_list_lock]
+	 */
+	loff_t i_dirty_start;
+
+	/**
+	 * @i_dirty_end:
+	 *
+	 * Inclusive offset in bytes where the dirty range for this inode
+	 * ends. [j_list_lock]
+	 */
+	loff_t i_dirty_end;
 };
 
 struct jbd2_revoke_table_s;
@@ -1397,6 +1413,12 @@ extern int	   jbd2_journal_force_commit(journal_t *);
 extern int	   jbd2_journal_force_commit_nested(journal_t *);
 extern int	   jbd2_journal_inode_add_write(handle_t *handle, struct jbd2_inode *inode);
 extern int	   jbd2_journal_inode_add_wait(handle_t *handle, struct jbd2_inode *inode);
+extern int	   jbd2_journal_inode_ranged_write(handle_t *handle,
+			struct jbd2_inode *inode, loff_t start_byte,
+			loff_t length);
+extern int	   jbd2_journal_inode_ranged_wait(handle_t *handle,
+			struct jbd2_inode *inode, loff_t start_byte,
+			loff_t length);
 extern int	   jbd2_journal_begin_ordered_truncate(journal_t *journal,
 				struct jbd2_inode *inode, loff_t new_size);
 extern void	   jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] ext4: use jbd2_inode dirty range scoping
  2019-06-19 17:21 [PATCH 0/3] Add dirty range scoping to jbd2 Ross Zwisler
  2019-06-19 17:21 ` [PATCH 1/3] mm: add filemap_fdatawait_range_keep_errors() Ross Zwisler
  2019-06-19 17:21 ` [PATCH 2/3] jbd2: introduce jbd2_inode dirty range scoping Ross Zwisler
@ 2019-06-19 17:21 ` Ross Zwisler
  2019-06-20 11:15   ` Jan Kara
  2 siblings, 1 reply; 9+ messages in thread
From: Ross Zwisler @ 2019-06-19 17:21 UTC (permalink / raw)
  To: linux-kernel
  Cc: Ross Zwisler, Theodore Ts'o, Alexander Viro, Andreas Dilger,
	Jan Kara, linux-ext4, linux-fsdevel, linux-mm, Fletcher Woodruff,
	Justin TerAvest

Use the newly introduced jbd2_inode dirty range scoping to prevent us
from waiting forever when trying to complete a journal transaction.

Signed-off-by: Ross Zwisler <zwisler@google.com>
---
 fs/ext4/ext4_jbd2.h   | 12 ++++++------
 fs/ext4/inode.c       | 13 ++++++++++---
 fs/ext4/move_extent.c |  3 ++-
 3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 75a5309f22315..ef8fcf7d0d3b3 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -361,20 +361,20 @@ static inline int ext4_journal_force_commit(journal_t *journal)
 }
 
 static inline int ext4_jbd2_inode_add_write(handle_t *handle,
-					    struct inode *inode)
+		struct inode *inode, loff_t start_byte, loff_t length)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2_journal_inode_add_write(handle,
-						    EXT4_I(inode)->jinode);
+		return jbd2_journal_inode_ranged_write(handle,
+				EXT4_I(inode)->jinode, start_byte, length);
 	return 0;
 }
 
 static inline int ext4_jbd2_inode_add_wait(handle_t *handle,
-					   struct inode *inode)
+		struct inode *inode, loff_t start_byte, loff_t length)
 {
 	if (ext4_handle_valid(handle))
-		return jbd2_journal_inode_add_wait(handle,
-						   EXT4_I(inode)->jinode);
+		return jbd2_journal_inode_ranged_wait(handle,
+				EXT4_I(inode)->jinode, start_byte, length);
 	return 0;
 }
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c7f77c6430085..27fec5c594459 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -731,10 +731,16 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
 		    !ext4_is_quota_file(inode) &&
 		    ext4_should_order_data(inode)) {
+			loff_t start_byte =
+				(loff_t)map->m_lblk << inode->i_blkbits;
+			loff_t length = (loff_t)map->m_len << inode->i_blkbits;
+
 			if (flags & EXT4_GET_BLOCKS_IO_SUBMIT)
-				ret = ext4_jbd2_inode_add_wait(handle, inode);
+				ret = ext4_jbd2_inode_add_wait(handle, inode,
+						start_byte, length);
 			else
-				ret = ext4_jbd2_inode_add_write(handle, inode);
+				ret = ext4_jbd2_inode_add_write(handle, inode,
+						start_byte, length);
 			if (ret)
 				return ret;
 		}
@@ -4085,7 +4091,8 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 		err = 0;
 		mark_buffer_dirty(bh);
 		if (ext4_should_order_data(inode))
-			err = ext4_jbd2_inode_add_write(handle, inode);
+			err = ext4_jbd2_inode_add_write(handle, inode, from,
+					length);
 	}
 
 unlock:
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 1083a9f3f16a1..c7ded4e2adff5 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -390,7 +390,8 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
 
 	/* Even in case of data=writeback it is reasonable to pin
 	 * inode to transaction, to prevent unexpected data loss */
-	*err = ext4_jbd2_inode_add_write(handle, orig_inode);
+	*err = ext4_jbd2_inode_add_write(handle, orig_inode,
+			(loff_t)orig_page_offset << PAGE_SHIFT, replaced_size);
 
 unlock_pages:
 	unlock_page(pagep[0]);
-- 
2.22.0.410.gd8fdbe21b5-goog


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] mm: add filemap_fdatawait_range_keep_errors()
  2019-06-19 17:21 ` [PATCH 1/3] mm: add filemap_fdatawait_range_keep_errors() Ross Zwisler
@ 2019-06-20  9:25   ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2019-06-20  9:25 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Ross Zwisler, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Jan Kara, linux-ext4, linux-fsdevel, linux-mm,
	Fletcher Woodruff, Justin TerAvest

On Wed 19-06-19 11:21:54, Ross Zwisler wrote:
> In the spirit of filemap_fdatawait_range() and
> filemap_fdatawait_keep_errors(), introduce
> filemap_fdatawait_range_keep_errors() which both takes a range upon
> which to wait and does not clear errors from the address space.
> 
> Signed-off-by: Ross Zwisler <zwisler@google.com>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  include/linux/fs.h |  2 ++
>  mm/filemap.c       | 22 ++++++++++++++++++++++
>  2 files changed, 24 insertions(+)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f7fdfe93e25d3..79fec8a8413f4 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2712,6 +2712,8 @@ extern int filemap_flush(struct address_space *);
>  extern int filemap_fdatawait_keep_errors(struct address_space *mapping);
>  extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
>  				   loff_t lend);
> +extern int filemap_fdatawait_range_keep_errors(struct address_space *mapping,
> +		loff_t start_byte, loff_t end_byte);
>  
>  static inline int filemap_fdatawait(struct address_space *mapping)
>  {
> diff --git a/mm/filemap.c b/mm/filemap.c
> index df2006ba0cfa5..e87252ca0835a 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -553,6 +553,28 @@ int filemap_fdatawait_range(struct address_space *mapping, loff_t start_byte,
>  }
>  EXPORT_SYMBOL(filemap_fdatawait_range);
>  
> +/**
> + * filemap_fdatawait_range_keep_errors - wait for writeback to complete
> + * @mapping:		address space structure to wait for
> + * @start_byte:		offset in bytes where the range starts
> + * @end_byte:		offset in bytes where the range ends (inclusive)
> + *
> + * Walk the list of under-writeback pages of the given address space in the
> + * given range and wait for all of them.  Unlike filemap_fdatawait_range(),
> + * this function does not clear error status of the address space.
> + *
> + * Use this function if callers don't handle errors themselves.  Expected
> + * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2),
> + * fsfreeze(8)
> + */
> +int filemap_fdatawait_range_keep_errors(struct address_space *mapping,
> +		loff_t start_byte, loff_t end_byte)
> +{
> +	__filemap_fdatawait_range(mapping, start_byte, end_byte);
> +	return filemap_check_and_keep_errors(mapping);
> +}
> +EXPORT_SYMBOL(filemap_fdatawait_range_keep_errors);
> +
>  /**
>   * file_fdatawait_range - wait for writeback to complete
>   * @file:		file pointing to address space structure to wait for
> -- 
> 2.22.0.410.gd8fdbe21b5-goog
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] jbd2: introduce jbd2_inode dirty range scoping
  2019-06-19 17:21 ` [PATCH 2/3] jbd2: introduce jbd2_inode dirty range scoping Ross Zwisler
@ 2019-06-20 11:04   ` Jan Kara
  2019-06-20 15:09     ` Ross Zwisler
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2019-06-20 11:04 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Ross Zwisler, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Jan Kara, linux-ext4, linux-fsdevel, linux-mm,
	Fletcher Woodruff, Justin TerAvest

On Wed 19-06-19 11:21:55, Ross Zwisler wrote:
> Currently both journal_submit_inode_data_buffers() and
> journal_finish_inode_data_buffers() operate on the entire address space
> of each of the inodes associated with a given journal entry.  The
> consequence of this is that if we have an inode where we are constantly
> appending dirty pages we can end up waiting for an indefinite amount of
> time in journal_finish_inode_data_buffers() while we wait for all the
> pages under writeback to be written out.
> 
> The easiest way to cause this type of workload is do just dd from
> /dev/zero to a file until it fills the entire filesystem.  This can
> cause journal_finish_inode_data_buffers() to wait for the duration of
> the entire dd operation.
> 
> We can improve this situation by scoping each of the inode dirty ranges
> associated with a given transaction.  We do this via the jbd2_inode
> structure so that the scoping is contained within jbd2 and so that it
> follows the lifetime and locking rules for that structure.
> 
> This allows us to limit the writeback & wait in
> journal_submit_inode_data_buffers() and
> journal_finish_inode_data_buffers() respectively to the dirty range for
> a given struct jdb2_inode, keeping us from waiting forever if the inode
> in question is still being appended to.
> 
> Signed-off-by: Ross Zwisler <zwisler@google.com>

The patch looks good to me. I was thinking whether we should not have
separate ranges for current and the next transaction but I guess it is not
worth it at least for now. So just one nit below. With that applied feel free
to add:

Reviewed-by: Jan Kara <jack@suse.cz>

> @@ -257,15 +262,24 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
>  	/* For locking, see the comment in journal_submit_data_buffers() */
>  	spin_lock(&journal->j_list_lock);
>  	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> +		loff_t dirty_start = jinode->i_dirty_start;
> +		loff_t dirty_end = jinode->i_dirty_end;
> +
>  		if (!(jinode->i_flags & JI_WAIT_DATA))
>  			continue;
>  		jinode->i_flags |= JI_COMMIT_RUNNING;
>  		spin_unlock(&journal->j_list_lock);
> -		err = filemap_fdatawait_keep_errors(
> -				jinode->i_vfs_inode->i_mapping);
> +		err = filemap_fdatawait_range_keep_errors(
> +				jinode->i_vfs_inode->i_mapping, dirty_start,
> +				dirty_end);
>  		if (!ret)
>  			ret = err;
>  		spin_lock(&journal->j_list_lock);
> +
> +		if (!jinode->i_next_transaction) {
> +			jinode->i_dirty_start = 0;
> +			jinode->i_dirty_end = 0;
> +		}

This would be more logical in the next loop that moves jinode into the next
transaction.

>  		jinode->i_flags &= ~JI_COMMIT_RUNNING;
>  		smp_mb();
>  		wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] ext4: use jbd2_inode dirty range scoping
  2019-06-19 17:21 ` [PATCH 3/3] ext4: use " Ross Zwisler
@ 2019-06-20 11:15   ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2019-06-20 11:15 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: linux-kernel, Ross Zwisler, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Jan Kara, linux-ext4, linux-fsdevel, linux-mm,
	Fletcher Woodruff, Justin TerAvest

On Wed 19-06-19 11:21:56, Ross Zwisler wrote:
> Use the newly introduced jbd2_inode dirty range scoping to prevent us
> from waiting forever when trying to complete a journal transaction.
> 
> Signed-off-by: Ross Zwisler <zwisler@google.com>

Looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/ext4_jbd2.h   | 12 ++++++------
>  fs/ext4/inode.c       | 13 ++++++++++---
>  fs/ext4/move_extent.c |  3 ++-
>  3 files changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 75a5309f22315..ef8fcf7d0d3b3 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -361,20 +361,20 @@ static inline int ext4_journal_force_commit(journal_t *journal)
>  }
>  
>  static inline int ext4_jbd2_inode_add_write(handle_t *handle,
> -					    struct inode *inode)
> +		struct inode *inode, loff_t start_byte, loff_t length)
>  {
>  	if (ext4_handle_valid(handle))
> -		return jbd2_journal_inode_add_write(handle,
> -						    EXT4_I(inode)->jinode);
> +		return jbd2_journal_inode_ranged_write(handle,
> +				EXT4_I(inode)->jinode, start_byte, length);
>  	return 0;
>  }
>  
>  static inline int ext4_jbd2_inode_add_wait(handle_t *handle,
> -					   struct inode *inode)
> +		struct inode *inode, loff_t start_byte, loff_t length)
>  {
>  	if (ext4_handle_valid(handle))
> -		return jbd2_journal_inode_add_wait(handle,
> -						   EXT4_I(inode)->jinode);
> +		return jbd2_journal_inode_ranged_wait(handle,
> +				EXT4_I(inode)->jinode, start_byte, length);
>  	return 0;
>  }
>  
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c7f77c6430085..27fec5c594459 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -731,10 +731,16 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>  		    !(flags & EXT4_GET_BLOCKS_ZERO) &&
>  		    !ext4_is_quota_file(inode) &&
>  		    ext4_should_order_data(inode)) {
> +			loff_t start_byte =
> +				(loff_t)map->m_lblk << inode->i_blkbits;
> +			loff_t length = (loff_t)map->m_len << inode->i_blkbits;
> +
>  			if (flags & EXT4_GET_BLOCKS_IO_SUBMIT)
> -				ret = ext4_jbd2_inode_add_wait(handle, inode);
> +				ret = ext4_jbd2_inode_add_wait(handle, inode,
> +						start_byte, length);
>  			else
> -				ret = ext4_jbd2_inode_add_write(handle, inode);
> +				ret = ext4_jbd2_inode_add_write(handle, inode,
> +						start_byte, length);
>  			if (ret)
>  				return ret;
>  		}
> @@ -4085,7 +4091,8 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>  		err = 0;
>  		mark_buffer_dirty(bh);
>  		if (ext4_should_order_data(inode))
> -			err = ext4_jbd2_inode_add_write(handle, inode);
> +			err = ext4_jbd2_inode_add_write(handle, inode, from,
> +					length);
>  	}
>  
>  unlock:
> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> index 1083a9f3f16a1..c7ded4e2adff5 100644
> --- a/fs/ext4/move_extent.c
> +++ b/fs/ext4/move_extent.c
> @@ -390,7 +390,8 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
>  
>  	/* Even in case of data=writeback it is reasonable to pin
>  	 * inode to transaction, to prevent unexpected data loss */
> -	*err = ext4_jbd2_inode_add_write(handle, orig_inode);
> +	*err = ext4_jbd2_inode_add_write(handle, orig_inode,
> +			(loff_t)orig_page_offset << PAGE_SHIFT, replaced_size);
>  
>  unlock_pages:
>  	unlock_page(pagep[0]);
> -- 
> 2.22.0.410.gd8fdbe21b5-goog
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] jbd2: introduce jbd2_inode dirty range scoping
  2019-06-20 11:04   ` Jan Kara
@ 2019-06-20 15:09     ` Ross Zwisler
  2019-06-20 17:22       ` Theodore Ts'o
  0 siblings, 1 reply; 9+ messages in thread
From: Ross Zwisler @ 2019-06-20 15:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: Ross Zwisler, linux-kernel, Theodore Ts'o, Alexander Viro,
	Andreas Dilger, Jan Kara, linux-ext4, linux-fsdevel, linux-mm,
	Fletcher Woodruff, Justin TerAvest

On Thu, Jun 20, 2019 at 01:04:54PM +0200, Jan Kara wrote:
> On Wed 19-06-19 11:21:55, Ross Zwisler wrote:
> > Currently both journal_submit_inode_data_buffers() and
> > journal_finish_inode_data_buffers() operate on the entire address space
> > of each of the inodes associated with a given journal entry.  The
> > consequence of this is that if we have an inode where we are constantly
> > appending dirty pages we can end up waiting for an indefinite amount of
> > time in journal_finish_inode_data_buffers() while we wait for all the
> > pages under writeback to be written out.
> > 
> > The easiest way to cause this type of workload is do just dd from
> > /dev/zero to a file until it fills the entire filesystem.  This can
> > cause journal_finish_inode_data_buffers() to wait for the duration of
> > the entire dd operation.
> > 
> > We can improve this situation by scoping each of the inode dirty ranges
> > associated with a given transaction.  We do this via the jbd2_inode
> > structure so that the scoping is contained within jbd2 and so that it
> > follows the lifetime and locking rules for that structure.
> > 
> > This allows us to limit the writeback & wait in
> > journal_submit_inode_data_buffers() and
> > journal_finish_inode_data_buffers() respectively to the dirty range for
> > a given struct jdb2_inode, keeping us from waiting forever if the inode
> > in question is still being appended to.
> > 
> > Signed-off-by: Ross Zwisler <zwisler@google.com>
> 
> The patch looks good to me. I was thinking whether we should not have
> separate ranges for current and the next transaction but I guess it is not
> worth it at least for now. So just one nit below. With that applied feel free
> to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

We could definitely keep separate dirty ranges for each of the current and
next transaction.  I think the case where you would see a difference would be
if you had multiple transactions in a row which grew the dirty range for a
given jbd2_inode, and then had a random I/O workload which kept dirtying pages
inside that enlarged dirty range.

I'm not sure how often this type of workload would be a problem.  For the
workloads I've been testing which purely append to the inode, having a single
dirty range per jbd2_inode is sufficient.

I guess for now this single range seems simpler, but if later we find that
someone would benefit from separate tracking for each of the current and next
transactions, I'll take a shot at adding it.

Thank you for the review!

> > @@ -257,15 +262,24 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
> >  	/* For locking, see the comment in journal_submit_data_buffers() */
> >  	spin_lock(&journal->j_list_lock);
> >  	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> > +		loff_t dirty_start = jinode->i_dirty_start;
> > +		loff_t dirty_end = jinode->i_dirty_end;
> > +
> >  		if (!(jinode->i_flags & JI_WAIT_DATA))
> >  			continue;
> >  		jinode->i_flags |= JI_COMMIT_RUNNING;
> >  		spin_unlock(&journal->j_list_lock);
> > -		err = filemap_fdatawait_keep_errors(
> > -				jinode->i_vfs_inode->i_mapping);
> > +		err = filemap_fdatawait_range_keep_errors(
> > +				jinode->i_vfs_inode->i_mapping, dirty_start,
> > +				dirty_end);
> >  		if (!ret)
> >  			ret = err;
> >  		spin_lock(&journal->j_list_lock);
> > +
> > +		if (!jinode->i_next_transaction) {
> > +			jinode->i_dirty_start = 0;
> > +			jinode->i_dirty_end = 0;
> > +		}
> 
> This would be more logical in the next loop that moves jinode into the next
> transaction.

Yep, agreed, this is much better.  Fixed in v2.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] jbd2: introduce jbd2_inode dirty range scoping
  2019-06-20 15:09     ` Ross Zwisler
@ 2019-06-20 17:22       ` Theodore Ts'o
  0 siblings, 0 replies; 9+ messages in thread
From: Theodore Ts'o @ 2019-06-20 17:22 UTC (permalink / raw)
  To: Ross Zwisler
  Cc: Jan Kara, Ross Zwisler, linux-kernel, Alexander Viro,
	Andreas Dilger, Jan Kara, linux-ext4, linux-fsdevel, linux-mm,
	Fletcher Woodruff, Justin TerAvest

On Thu, Jun 20, 2019 at 09:09:11AM -0600, Ross Zwisler wrote:
> We could definitely keep separate dirty ranges for each of the current and
> next transaction.  I think the case where you would see a difference would be
> if you had multiple transactions in a row which grew the dirty range for a
> given jbd2_inode, and then had a random I/O workload which kept dirtying pages
> inside that enlarged dirty range.
> 
> I'm not sure how often this type of workload would be a problem.  For the
> workloads I've been testing which purely append to the inode, having a single
> dirty range per jbd2_inode is sufficient.

My inclination would be to keep things simple for now, unless we have
a real workload that tickles this.  In the long run I'm hoping to
remove the need to do writebacks from the journal thread altogether,
by always updating the metadata blocks *after* the I/O completes,
instead of before we submit the I/O.

					- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-06-20 17:22 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-19 17:21 [PATCH 0/3] Add dirty range scoping to jbd2 Ross Zwisler
2019-06-19 17:21 ` [PATCH 1/3] mm: add filemap_fdatawait_range_keep_errors() Ross Zwisler
2019-06-20  9:25   ` Jan Kara
2019-06-19 17:21 ` [PATCH 2/3] jbd2: introduce jbd2_inode dirty range scoping Ross Zwisler
2019-06-20 11:04   ` Jan Kara
2019-06-20 15:09     ` Ross Zwisler
2019-06-20 17:22       ` Theodore Ts'o
2019-06-19 17:21 ` [PATCH 3/3] ext4: use " Ross Zwisler
2019-06-20 11:15   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).