All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit
@ 2020-09-28 19:40 Mauricio Faria de Oliveira
  2020-09-28 19:41 ` [RFC PATCH v4 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-09-28 19:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier

Hey Jan,

This series implements your suggestions for the RFC PATCH v3 set [1].

That addressed the issue you confirmed with block_page_mkwrite() [2].
There's no "JBD2: Spotted dirty metadata buffer" message in over 72h
runs across 3 VMs (it used to happen within a few hours.) *Thanks!*

I added Reviewed-by: tags for the patches/changes you mentioned.
The only changes from v3 are patch 3 which is new, and contains
the 2 fixes to ext4_page_mkwrite(); and patch 4 changed 'struct
writeback_control.nr_to_write' from ~0ULL to LONG_MAX, since it
is signed long (~0ULL overflows to -1; kudos, kernel test robot!)

It looks almost good on fstests: zero regressions on data=ordered,
and two apparently flaky tests data=journal (generic/347 _passed_
1/6 times with the patch, and generic/547 failed 1/6 times.)

I'll have more fstests runs on original/patched kernel to confirm
if they are flaky on both, or hit corner cases with patches only,
and will follow up with results. (and basic testing w/ ocfs2 too.)

Thanks again!
Mauricio

[1] https://lore.kernel.org/linux-ext4/20200910193127.276214-1-mfo@canonical.com/
[2] https://lore.kernel.org/linux-ext4/20200916161538.GK3607@quack2.suse.cz/

Mauricio Faria de Oliveira (4):
  jbd2: introduce/export functions
    jbd2_journal_submit|finish_inode_data_buffers()
  jbd2, ext4, ocfs2: introduce/use journal callbacks
    j_submit|finish_inode_data_buffers()
  ext4: data=journal: fixes for ext4_page_mkwrite()
  ext4: data=journal: write-protect pages on
    j_submit_inode_data_buffers()

 fs/ext4/inode.c      | 62 ++++++++++++++++++++++++++++-----
 fs/ext4/super.c      | 82 ++++++++++++++++++++++++++++++++++++++++++++
 fs/jbd2/commit.c     | 58 +++++++++++++++++--------------
 fs/jbd2/journal.c    |  2 ++
 fs/ocfs2/super.c     |  5 +++
 include/linux/jbd2.h | 29 +++++++++++++++-
 6 files changed, 203 insertions(+), 35 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [RFC PATCH v4 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()
  2020-09-28 19:40 [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
@ 2020-09-28 19:41 ` Mauricio Faria de Oliveira
  2020-09-29  2:24   ` Andreas Dilger
  2020-09-28 19:41 ` [RFC PATCH v4 2/4] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-09-28 19:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier

Export functions that implement the current behavior done
for an inode in journal_submit|finish_inode_data_buffers().

No functional change.

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/commit.c     | 32 +++++++++++++++++---------------
 fs/jbd2/journal.c    |  2 ++
 include/linux/jbd2.h |  4 ++++
 3 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 6d2da8ad0e6f..c17cda96926e 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -187,9 +187,11 @@ static int journal_wait_on_commit_record(journal_t *journal,
  * use writepages() because with delayed allocation we may be doing
  * block allocation in writepages().
  */
-static int journal_submit_inode_data_buffers(struct address_space *mapping,
-		loff_t dirty_start, loff_t dirty_end)
+int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
 {
+	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+	loff_t dirty_start = jinode->i_dirty_start;
+	loff_t dirty_end = jinode->i_dirty_end;
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode =  WB_SYNC_ALL,
@@ -215,16 +217,11 @@ static int journal_submit_data_buffers(journal_t *journal,
 {
 	struct jbd2_inode *jinode;
 	int err, ret = 0;
-	struct address_space *mapping;
 
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
-		loff_t dirty_start = jinode->i_dirty_start;
-		loff_t dirty_end = jinode->i_dirty_end;
-
 		if (!(jinode->i_flags & JI_WRITE_DATA))
 			continue;
-		mapping = jinode->i_vfs_inode->i_mapping;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
 		/*
@@ -234,8 +231,7 @@ static int journal_submit_data_buffers(journal_t *journal,
 		 * only allocated blocks here.
 		 */
 		trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
-		err = journal_submit_inode_data_buffers(mapping, dirty_start,
-				dirty_end);
+		err = jbd2_journal_submit_inode_data_buffers(jinode);
 		if (!ret)
 			ret = err;
 		spin_lock(&journal->j_list_lock);
@@ -248,6 +244,17 @@ static int journal_submit_data_buffers(journal_t *journal,
 	return ret;
 }
 
+int jbd2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+	loff_t dirty_start = jinode->i_dirty_start;
+	loff_t dirty_end = jinode->i_dirty_end;
+	int ret;
+
+	ret = filemap_fdatawait_range_keep_errors(mapping, dirty_start, dirty_end);
+	return ret;
+}
+
 /*
  * Wait for data submitted for writeout, refile inodes to proper
  * transaction if needed.
@@ -262,16 +269,11 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
 	/* For locking, see the comment in journal_submit_data_buffers() */
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
-		loff_t dirty_start = jinode->i_dirty_start;
-		loff_t dirty_end = jinode->i_dirty_end;
-
 		if (!(jinode->i_flags & JI_WAIT_DATA))
 			continue;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
-		err = filemap_fdatawait_range_keep_errors(
-				jinode->i_vfs_inode->i_mapping, dirty_start,
-				dirty_end);
+		err = jbd2_journal_finish_inode_data_buffers(jinode);
 		if (!ret)
 			ret = err;
 		spin_lock(&journal->j_list_lock);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 17fdc482f554..c0600405e7a2 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -91,6 +91,8 @@ EXPORT_SYMBOL(jbd2_journal_try_to_free_buffers);
 EXPORT_SYMBOL(jbd2_journal_force_commit);
 EXPORT_SYMBOL(jbd2_journal_inode_ranged_write);
 EXPORT_SYMBOL(jbd2_journal_inode_ranged_wait);
+EXPORT_SYMBOL(jbd2_journal_submit_inode_data_buffers);
+EXPORT_SYMBOL(jbd2_journal_finish_inode_data_buffers);
 EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 08f904943ab2..2865a5475888 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1421,6 +1421,10 @@ extern int	   jbd2_journal_inode_ranged_write(handle_t *handle,
 extern int	   jbd2_journal_inode_ranged_wait(handle_t *handle,
 			struct jbd2_inode *inode, loff_t start_byte,
 			loff_t length);
+extern int	   jbd2_journal_submit_inode_data_buffers(
+			struct jbd2_inode *jinode);
+extern int	   jbd2_journal_finish_inode_data_buffers(
+			struct jbd2_inode *jinode);
 extern int	   jbd2_journal_begin_ordered_truncate(journal_t *journal,
 				struct jbd2_inode *inode, loff_t new_size);
 extern void	   jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH v4 2/4] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()
  2020-09-28 19:40 [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
  2020-09-28 19:41 ` [RFC PATCH v4 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-28 19:41 ` Mauricio Faria de Oliveira
  2020-09-29  2:28   ` Andreas Dilger
  2020-09-28 19:41 ` [RFC PATCH v4 3/4] ext4: data=journal: fixes for ext4_page_mkwrite() Mauricio Faria de Oliveira
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-09-28 19:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier

Introduce journal callbacks to allow different behaviors
for an inode in journal_submit|finish_inode_data_buffers().

The existing users of the current behavior (ext4, ocfs2)
are adapted to use the previously exported functions
that implement the current behavior.

Users are callers of jbd2_journal_inode_ranged_write|wait(),
which adds the inode to the transaction's inode list with
the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree.

Both CONFIG_EXT4_FS and CONFIG_OCSFS2_FS select CONFIG_JBD2,
which builds fs/jbd2/commit.c and journal.c that define and
export the functions, so we can call directly in ext4/ocfs2.

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/super.c      |  4 ++++
 fs/jbd2/commit.c     | 30 ++++++++++++++++++------------
 fs/ocfs2/super.c     |  5 +++++
 include/linux/jbd2.h | 25 ++++++++++++++++++++++++-
 4 files changed, 51 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index ea425b49b345..a14c1ed39aa3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4646,6 +4646,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
 
 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
+	sbi->s_journal->j_submit_inode_data_buffers =
+		jbd2_journal_submit_inode_data_buffers;
+	sbi->s_journal->j_finish_inode_data_buffers =
+		jbd2_journal_finish_inode_data_buffers;
 
 no_journal:
 	if (!test_opt(sb, NO_MBCACHE)) {
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index c17cda96926e..23d3fcc11b97 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -200,6 +200,12 @@ int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
 		.range_end = dirty_end,
 	};
 
+	/*
+	 * submit the inode data buffers. We use writepage
+	 * instead of writepages. Because writepages can do
+	 * block allocation with delalloc. We need to write
+	 * only allocated blocks here.
+	 */
 	ret = generic_writepages(mapping, &wbc);
 	return ret;
 }
@@ -224,16 +230,13 @@ static int journal_submit_data_buffers(journal_t *journal,
 			continue;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
-		/*
-		 * submit the inode data buffers. We use writepage
-		 * instead of writepages. Because writepages can do
-		 * block allocation  with delalloc. We need to write
-		 * only allocated blocks here.
-		 */
+		/* submit the inode data buffers. */
 		trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
-		err = jbd2_journal_submit_inode_data_buffers(jinode);
-		if (!ret)
-			ret = err;
+		if (journal->j_submit_inode_data_buffers) {
+			err = journal->j_submit_inode_data_buffers(jinode);
+			if (!ret)
+				ret = err;
+		}
 		spin_lock(&journal->j_list_lock);
 		J_ASSERT(jinode->i_transaction == commit_transaction);
 		jinode->i_flags &= ~JI_COMMIT_RUNNING;
@@ -273,9 +276,12 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
 			continue;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
-		err = jbd2_journal_finish_inode_data_buffers(jinode);
-		if (!ret)
-			ret = err;
+		/* wait for the inode data buffers writeout. */
+		if (journal->j_finish_inode_data_buffers) {
+			err = journal->j_finish_inode_data_buffers(jinode);
+			if (!ret)
+				ret = err;
+		}
 		spin_lock(&journal->j_list_lock);
 		jinode->i_flags &= ~JI_COMMIT_RUNNING;
 		smp_mb();
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 1d91dd1e8711..560f13d4e2aa 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -2211,6 +2211,11 @@ static int ocfs2_initialize_super(struct super_block *sb,
 	}
 	osb->journal = journal;
 	journal->j_osb = osb;
+	journal->j_journal->j_submit_inode_data_buffers =
+		jbd2_journal_submit_inode_data_buffers;
+	journal->j_journal->j_finish_inode_data_buffers =
+		jbd2_journal_finish_inode_data_buffers;
+
 
 	atomic_set(&journal->j_num_trans, 0);
 	init_rwsem(&journal->j_trans_barrier);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 2865a5475888..4aaa408c0ca7 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -629,7 +629,9 @@ struct transaction_s
 	struct journal_head	*t_shadow_list;
 
 	/*
-	 * List of inodes whose data we've modified in data=ordered mode.
+	 * List of inodes associated with the transaction; e.g., ext4 uses
+	 * this to track inodes in data=ordered and data=journal mode that
+	 * need special handling on transaction commit; also used by ocfs2.
 	 * [j_list_lock]
 	 */
 	struct list_head	t_inode_list;
@@ -1111,6 +1113,27 @@ struct journal_s
 	void			(*j_commit_callback)(journal_t *,
 						     transaction_t *);
 
+	/**
+	 * @j_submit_inode_data_buffers:
+	 *
+	 * This function is called for all inodes associated with the
+	 * committing transaction marked with JI_WRITE_DATA flag
+	 * before we start to write out the transaction to the journal.
+	 */
+	int			(*j_submit_inode_data_buffers)
+					(struct jbd2_inode *);
+
+	/**
+	 * @j_finish_inode_data_buffers:
+	 *
+	 * This function is called for all inodes associated with the
+	 * committing transaction marked with JI_WAIT_DATA flag
+	 * after we have written the transaction to the journal
+	 * but before we write out the commit block.
+	 */
+	int			(*j_finish_inode_data_buffers)
+					(struct jbd2_inode *);
+
 	/*
 	 * Journal statistics
 	 */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH v4 3/4] ext4: data=journal: fixes for ext4_page_mkwrite()
  2020-09-28 19:40 [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
  2020-09-28 19:41 ` [RFC PATCH v4 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
  2020-09-28 19:41 ` [RFC PATCH v4 2/4] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-28 19:41 ` Mauricio Faria de Oliveira
  2020-09-29  2:34   ` Andreas Dilger
  2020-09-29 11:48   ` Jan Kara
  2020-09-28 19:41 ` [RFC PATCH v4 4/4] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() Mauricio Faria de Oliveira
  2020-09-29 11:37 ` [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit Jan Kara
  4 siblings, 2 replies; 16+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-09-28 19:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier

These are two fixes for data journalling required by
the next patch, discovered while testing it.

First, the optimization to return early if all buffers
are mapped is not appropriate for the next patch:

The inode _must_ be added to the transaction's list in
data=journal mode (so to write-protect pages on commit)
thus we cannot return early there.

Second, once that optimization to reduce transactions
was disabled for data=journal mode, more transactions
happened, and occasionally hit this warning message:
'JBD2: Spotted dirty metadata buffer'.

Reason is, block_page_mkwrite() will set_buffer_dirty()
before do_journal_get_write_access() that is there to
prevent it. This issue was masked by the optimization.

So, on data=journal use __block_write_begin() instead.
This also requires page locking and len recalculation.
(see block_page_mkwrite() for implementation details.)

Finally, as Jan noted there is little sharing between
data=journal and other modes in ext4_page_mkwrite().

However, a prototype of ext4_journalled_page_mkwrite()
showed there still would be lots of duplicated lines
(tens of) that didn't seem worth it.

Thus this patch ends up with an ugly goto to skip all
non-data journalling code (to avoid long indentations,
but that can be changed..) in the beginning, and just
a conditional in the transaction section.

Well, we skip a common part to data journalling which
is the page truncated check, but we do it again after
ext4_journal_start() when we re-acquire the page lock
(so not to acquire the page lock twice needlessly for
data journalling.)

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 51 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 44 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bf596467c234..ac153e340a6f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5977,9 +5977,17 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 	if (err)
 		goto out_ret;
 
+	/*
+	 * On data journalling we skip straight to the transaction handle:
+	 * there's no delalloc; page truncated will be checked later; the
+	 * early return w/ all buffers mapped (calculates size/len) can't
+	 * be used; and there's no dioread_nolock, so only ext4_get_block.
+	 */
+	if (ext4_should_journal_data(inode))
+		goto retry_alloc;
+
 	/* Delalloc case is easy... */
 	if (test_opt(inode->i_sb, DELALLOC) &&
-	    !ext4_should_journal_data(inode) &&
 	    !ext4_nonda_switch(inode->i_sb)) {
 		do {
 			err = block_page_mkwrite(vma, vmf,
@@ -6005,6 +6013,9 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 	/*
 	 * Return if we have all the buffers mapped. This avoids the need to do
 	 * journal_start/journal_stop which can block and take a long time
+	 *
+	 * This cannot be done for data journalling, as we have to add the
+	 * inode to the transaction's list to writeprotect pages on commit.
 	 */
 	if (page_has_buffers(page)) {
 		if (!ext4_walk_page_buffers(NULL, page_buffers(page),
@@ -6029,16 +6040,42 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 		ret = VM_FAULT_SIGBUS;
 		goto out;
 	}
-	err = block_page_mkwrite(vma, vmf, get_block);
-	if (!err && ext4_should_journal_data(inode)) {
-		if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
-			  PAGE_SIZE, NULL, do_journal_get_write_access)) {
+	/*
+	 * Data journalling can't use block_page_mkwrite() because it
+	 * will set_buffer_dirty() before do_journal_get_write_access()
+	 * thus might hit warning messages for dirty metadata buffers.
+	 */
+	if (!ext4_should_journal_data(inode)) {
+		err = block_page_mkwrite(vma, vmf, get_block);
+	} else {
+		lock_page(page);
+		size = i_size_read(inode);
+		/* Page got truncated from under us? */
+		if (page->mapping != mapping || page_offset(page) > size) {
 			unlock_page(page);
-			ret = VM_FAULT_SIGBUS;
+			ret = VM_FAULT_NOPAGE;
 			ext4_journal_stop(handle);
 			goto out;
 		}
-		ext4_set_inode_state(inode, EXT4_STATE_JDATA);
+
+		if (page->index == size >> PAGE_SHIFT)
+			len = size & ~PAGE_MASK;
+		else
+			len = PAGE_SIZE;
+
+		err = __block_write_begin(page, 0, len, ext4_get_block);
+		if (!err) {
+			if (ext4_walk_page_buffers(handle, page_buffers(page),
+					0, len, NULL, do_journal_get_write_access)) {
+				unlock_page(page);
+				ret = VM_FAULT_SIGBUS;
+				ext4_journal_stop(handle);
+				goto out;
+			}
+			ext4_set_inode_state(inode, EXT4_STATE_JDATA);
+		} else {
+			unlock_page(page);
+		}
 	}
 	ext4_journal_stop(handle);
 	if (err == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH v4 4/4] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()
  2020-09-28 19:40 [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
                   ` (2 preceding siblings ...)
  2020-09-28 19:41 ` [RFC PATCH v4 3/4] ext4: data=journal: fixes for ext4_page_mkwrite() Mauricio Faria de Oliveira
@ 2020-09-28 19:41 ` Mauricio Faria de Oliveira
  2020-09-29 12:10   ` Jan Kara
  2020-09-29 11:37 ` [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit Jan Kara
  4 siblings, 1 reply; 16+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-09-28 19:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier

This implements journal callbacks j_submit|finish_inode_data_buffers()
with different behavior for data=journal: to write-protect pages under
commit, preventing changes to buffers writeably mapped to userspace.

If a buffer's content changes between commit's checksum calculation
and write-out to disk, it can cause journal recovery/mount failures
upon a kernel crash or power loss.

    [   27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support!
    [   27.339492] JBD2: Invalid checksum recovering data block 8705 in log
    [   27.342716] JBD2: recovery failed
    [   27.343316] EXT4-fs (loop0): error loading journal
    mount: /ext4: can't read superblock on /dev/loop0.

In j_submit_inode_data_buffers() we write-protect the inode's pages
with write_cache_pages() and redirty w/ writepage callback if needed.

In j_finish_inode_data_buffers() there is nothing do to.

And in order to use the callbacks, inodes are added to the inode list
in transaction in __ext4_journalled_writepage() and ext4_page_mkwrite().

In ext4_page_mkwrite() we must make sure that the buffers are attached
to the transaction as jbddirty with write_end_fn(), as already done in
__ext4_journalled_writepage().

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reported-by: Dann Frazier <dann.frazier@canonical.com>
Reported-by: kernel test robot <lkp@intel.com> # wbc.nr_to_write
Suggested-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 25 +++++++++------
 fs/ext4/super.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 96 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ac153e340a6f..af5de62c1214 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1910,6 +1910,9 @@ static int __ext4_journalled_writepage(struct page *page,
 		err = ext4_walk_page_buffers(handle, page_bufs, 0, len, NULL,
 					     write_end_fn);
 	}
+	if (ret == 0)
+		ret = err;
+	err = ext4_jbd2_inode_add_write(handle, inode, 0, len);
 	if (ret == 0)
 		ret = err;
 	EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
@@ -6052,10 +6055,8 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 		size = i_size_read(inode);
 		/* Page got truncated from under us? */
 		if (page->mapping != mapping || page_offset(page) > size) {
-			unlock_page(page);
 			ret = VM_FAULT_NOPAGE;
-			ext4_journal_stop(handle);
-			goto out;
+			goto out_error;
 		}
 
 		if (page->index == size >> PAGE_SHIFT)
@@ -6065,13 +6066,15 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 
 		err = __block_write_begin(page, 0, len, ext4_get_block);
 		if (!err) {
+			ret = VM_FAULT_SIGBUS;
 			if (ext4_walk_page_buffers(handle, page_buffers(page),
-					0, len, NULL, do_journal_get_write_access)) {
-				unlock_page(page);
-				ret = VM_FAULT_SIGBUS;
-				ext4_journal_stop(handle);
-				goto out;
-			}
+					0, len, NULL, do_journal_get_write_access))
+				goto out_error;
+			if (ext4_walk_page_buffers(handle, page_buffers(page),
+					0, len, NULL, write_end_fn))
+				goto out_error;
+			if (ext4_jbd2_inode_add_write(handle, inode, 0, len))
+				goto out_error;
 			ext4_set_inode_state(inode, EXT4_STATE_JDATA);
 		} else {
 			unlock_page(page);
@@ -6086,6 +6089,10 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 	up_read(&EXT4_I(inode)->i_mmap_sem);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
+out_error:
+	unlock_page(page);
+	ext4_journal_stop(handle);
+	goto out;
 }
 
 vm_fault_t ext4_filemap_fault(struct vm_fault *vmf)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a14c1ed39aa3..ac9558080fc7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -472,6 +472,84 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
 	spin_unlock(&sbi->s_md_lock);
 }
 
+/*
+ * This writepage callback for write_cache_pages()
+ * takes care of a few cases after page cleaning.
+ *
+ * write_cache_pages() already checks for dirty pages
+ * and calls clear_page_dirty_for_io(), which we want,
+ * to write protect the pages.
+ *
+ * However, we have to redirty a page in these cases:
+ * 1) some buffer is dirty (needs checkpointing)
+ * 2) some buffer is not part of the committing transaction
+ * 3) some buffer already has b_next_transaction set
+ */
+
+static int ext4_journalled_writepage_callback(struct page *page,
+					      struct writeback_control *wbc,
+					      void *data)
+{
+	transaction_t *transaction = (transaction_t *) data;
+	struct buffer_head *bh, *head;
+	struct journal_head *jh;
+
+	bh = head = page_buffers(page);
+	do {
+		jh = bh2jh(bh);
+		if (buffer_dirty(bh) ||
+			(jh && (jh->b_transaction != transaction ||
+				jh->b_next_transaction))) {
+			redirty_page_for_writepage(wbc, page);
+			goto out;
+		}
+	} while ((bh = bh->b_this_page) != head);
+
+out:
+	return AOP_WRITEPAGE_ACTIVATE;
+}
+
+static int ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+	transaction_t *transaction = jinode->i_transaction;
+	loff_t dirty_start = jinode->i_dirty_start;
+	loff_t dirty_end = jinode->i_dirty_end;
+
+	struct writeback_control wbc = {
+		.sync_mode =  WB_SYNC_ALL,
+		.nr_to_write = LONG_MAX,
+		.range_start = dirty_start,
+		.range_end = dirty_end,
+        };
+
+	return write_cache_pages(mapping, &wbc,
+				 ext4_journalled_writepage_callback,
+				 transaction);
+}
+
+static int ext4_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	int ret;
+
+	if (ext4_should_journal_data(jinode->i_vfs_inode))
+		ret = ext4_journalled_submit_inode_data_buffers(jinode);
+	else
+		ret = jbd2_journal_submit_inode_data_buffers(jinode);
+
+	return ret;
+}
+
+static int ext4_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	int ret = 0;
+
+	if (!ext4_should_journal_data(jinode->i_vfs_inode))
+		ret = jbd2_journal_finish_inode_data_buffers(jinode);
+
+	return ret;
+}
+
 static bool system_going_down(void)
 {
 	return system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF
@@ -4647,9 +4725,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 
 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
 	sbi->s_journal->j_submit_inode_data_buffers =
-		jbd2_journal_submit_inode_data_buffers;
+		ext4_journal_submit_inode_data_buffers;
 	sbi->s_journal->j_finish_inode_data_buffers =
-		jbd2_journal_finish_inode_data_buffers;
+		ext4_journal_finish_inode_data_buffers;
 
 no_journal:
 	if (!test_opt(sb, NO_MBCACHE)) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v4 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()
  2020-09-28 19:41 ` [RFC PATCH v4 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-29  2:24   ` Andreas Dilger
  2020-09-30 21:36     ` Mauricio Faria de Oliveira
  0 siblings, 1 reply; 16+ messages in thread
From: Andreas Dilger @ 2020-09-29  2:24 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira; +Cc: Jan Kara, linux-ext4, dann frazier

[-- Attachment #1: Type: text/plain, Size: 1561 bytes --]

On Sep 28, 2020, at 1:41 PM, Mauricio Faria de Oliveira <mfo@canonical.com> wrote:
> 
> Export functions that implement the current behavior done
> for an inode in journal_submit|finish_inode_data_buffers().
> 
> No functional change.
> 
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Jan Kara <jack@suse.cz>

A couple of minor cleanups below, but either way you could add:

Reviewed-by: Andreas Dilger <adilger@dilger.ca>

> +int jbd2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
> +{
> +	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
> +	loff_t dirty_start = jinode->i_dirty_start;
> +	loff_t dirty_end = jinode->i_dirty_end;
> +	int ret;
> +
> +	ret = filemap_fdatawait_range_keep_errors(mapping, dirty_start, dirty_end);
> +	return ret;
> +}

(style) still prefer to wrap at 80 columns if possible.
(style) there isn't any benefit to "dirty_start" and "dirty_end" as locals
(style) there also isn't any benefit to "ret = ...; return ret"

I thought it might be coded this way because the function is changed in a
later patch in the series, but I couldn't find anything like that, so the
shorter form is just as readable, IMHO:

int jbd2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
{
	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;

	return filemap_fdatawait_range_keep_errors(mapping,
						   jinode->dirty_start,
						   jinode->dirty_end);
}

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v4 2/4] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()
  2020-09-28 19:41 ` [RFC PATCH v4 2/4] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-29  2:28   ` Andreas Dilger
  0 siblings, 0 replies; 16+ messages in thread
From: Andreas Dilger @ 2020-09-29  2:28 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira; +Cc: Jan Kara, linux-ext4, dann frazier

[-- Attachment #1: Type: text/plain, Size: 987 bytes --]

On Sep 28, 2020, at 1:41 PM, Mauricio Faria de Oliveira <mfo@canonical.com> wrote:
> 
> Introduce journal callbacks to allow different behaviors
> for an inode in journal_submit|finish_inode_data_buffers().
> 
> The existing users of the current behavior (ext4, ocfs2)
> are adapted to use the previously exported functions
> that implement the current behavior.
> 
> Users are callers of jbd2_journal_inode_ranged_write|wait(),
> which adds the inode to the transaction's inode list with
> the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree.
> 
> Both CONFIG_EXT4_FS and CONFIG_OCSFS2_FS select CONFIG_JBD2,
> which builds fs/jbd2/commit.c and journal.c that define and
> export the functions, so we can call directly in ext4/ocfs2.
> 
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Jan Kara <jack@suse.cz>

Reviewed-by: Andreas Dilger <adilger@dilger.ca>

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v4 3/4] ext4: data=journal: fixes for ext4_page_mkwrite()
  2020-09-28 19:41 ` [RFC PATCH v4 3/4] ext4: data=journal: fixes for ext4_page_mkwrite() Mauricio Faria de Oliveira
@ 2020-09-29  2:34   ` Andreas Dilger
  2020-09-29 11:48   ` Jan Kara
  1 sibling, 0 replies; 16+ messages in thread
From: Andreas Dilger @ 2020-09-29  2:34 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira; +Cc: Jan Kara, linux-ext4, dann frazier

[-- Attachment #1: Type: text/plain, Size: 2108 bytes --]

On Sep 28, 2020, at 1:41 PM, Mauricio Faria de Oliveira <mfo@canonical.com> wrote:
> 
> These are two fixes for data journalling required by
> the next patch, discovered while testing it.
> 
> First, the optimization to return early if all buffers
> are mapped is not appropriate for the next patch:
> 
> The inode _must_ be added to the transaction's list in
> data=journal mode (so to write-protect pages on commit)
> thus we cannot return early there.
> 
> Second, once that optimization to reduce transactions
> was disabled for data=journal mode, more transactions
> happened, and occasionally hit this warning message:
> 'JBD2: Spotted dirty metadata buffer'.
> 
> Reason is, block_page_mkwrite() will set_buffer_dirty()
> before do_journal_get_write_access() that is there to
> prevent it. This issue was masked by the optimization.
> 
> So, on data=journal use __block_write_begin() instead.
> This also requires page locking and len recalculation.
> (see block_page_mkwrite() for implementation details.)
> 
> Finally, as Jan noted there is little sharing between
> data=journal and other modes in ext4_page_mkwrite().
> 
> However, a prototype of ext4_journalled_page_mkwrite()
> showed there still would be lots of duplicated lines
> (tens of) that didn't seem worth it.
> 
> Thus this patch ends up with an ugly goto to skip all
> non-data journalling code (to avoid long indentations,
> but that can be changed..) in the beginning, and just
> a conditional in the transaction section.
> 
> Well, we skip a common part to data journalling which
> is the page truncated check, but we do it again after
> ext4_journal_start() when we re-acquire the page lock
> (so not to acquire the page lock twice needlessly for
> data journalling.)
> 
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> Suggested-by: Jan Kara <jack@suse.cz>

Thanks for the clear commit message.  It definitely would not
be clear why the patch was structured in this way without it.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>


Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit
  2020-09-28 19:40 [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
                   ` (3 preceding siblings ...)
  2020-09-28 19:41 ` [RFC PATCH v4 4/4] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-29 11:37 ` Jan Kara
  2020-09-30 22:59   ` Mauricio Faria de Oliveira
  4 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2020-09-29 11:37 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira; +Cc: Jan Kara, linux-ext4, dann frazier

On Mon 28-09-20 16:40:59, Mauricio Faria de Oliveira wrote:
> Hey Jan,
> 
> This series implements your suggestions for the RFC PATCH v3 set [1].
> 
> That addressed the issue you confirmed with block_page_mkwrite() [2].
> There's no "JBD2: Spotted dirty metadata buffer" message in over 72h
> runs across 3 VMs (it used to happen within a few hours.) *Thanks!*
> 
> I added Reviewed-by: tags for the patches/changes you mentioned.
> The only changes from v3 are patch 3 which is new, and contains
> the 2 fixes to ext4_page_mkwrite(); and patch 4 changed 'struct
> writeback_control.nr_to_write' from ~0ULL to LONG_MAX, since it
> is signed long (~0ULL overflows to -1; kudos, kernel test robot!)
> 
> It looks almost good on fstests: zero regressions on data=ordered,
> and two apparently flaky tests data=journal (generic/347 _passed_
> 1/6 times with the patch, and generic/547 failed 1/6 times.)

Cool. Neither of these tests has anything to do with mmap. The first test
checks what happens when thin provisioned storage runs out of space (and
that fs remains consistent), the second tests that fsync flushed properly
all data and that it can be seen after a crash. So I'm reasonably confident
that it isn't caused by your patches. It still might be a bug in
data=journal implementation though but that would be something for another
patch series :).

I'll have a look at the remaining patches.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v4 3/4] ext4: data=journal: fixes for ext4_page_mkwrite()
  2020-09-28 19:41 ` [RFC PATCH v4 3/4] ext4: data=journal: fixes for ext4_page_mkwrite() Mauricio Faria de Oliveira
  2020-09-29  2:34   ` Andreas Dilger
@ 2020-09-29 11:48   ` Jan Kara
  1 sibling, 0 replies; 16+ messages in thread
From: Jan Kara @ 2020-09-29 11:48 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira; +Cc: Jan Kara, linux-ext4, dann frazier

On Mon 28-09-20 16:41:02, Mauricio Faria de Oliveira wrote:
> These are two fixes for data journalling required by
> the next patch, discovered while testing it.
> 
> First, the optimization to return early if all buffers
> are mapped is not appropriate for the next patch:
> 
> The inode _must_ be added to the transaction's list in
> data=journal mode (so to write-protect pages on commit)
> thus we cannot return early there.
> 
> Second, once that optimization to reduce transactions
> was disabled for data=journal mode, more transactions
> happened, and occasionally hit this warning message:
> 'JBD2: Spotted dirty metadata buffer'.
> 
> Reason is, block_page_mkwrite() will set_buffer_dirty()
> before do_journal_get_write_access() that is there to
> prevent it. This issue was masked by the optimization.
> 
> So, on data=journal use __block_write_begin() instead.
> This also requires page locking and len recalculation.
> (see block_page_mkwrite() for implementation details.)
> 
> Finally, as Jan noted there is little sharing between
> data=journal and other modes in ext4_page_mkwrite().
> 
> However, a prototype of ext4_journalled_page_mkwrite()
> showed there still would be lots of duplicated lines
> (tens of) that didn't seem worth it.
> 
> Thus this patch ends up with an ugly goto to skip all
> non-data journalling code (to avoid long indentations,
> but that can be changed..) in the beginning, and just
> a conditional in the transaction section.
> 
> Well, we skip a common part to data journalling which
> is the page truncated check, but we do it again after
> ext4_journal_start() when we re-acquire the page lock
> (so not to acquire the page lock twice needlessly for
> data journalling.)
> 
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> Suggested-by: Jan Kara <jack@suse.cz>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

							Honza

> ---
>  fs/ext4/inode.c | 51 ++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 44 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index bf596467c234..ac153e340a6f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5977,9 +5977,17 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  	if (err)
>  		goto out_ret;
>  
> +	/*
> +	 * On data journalling we skip straight to the transaction handle:
> +	 * there's no delalloc; page truncated will be checked later; the
> +	 * early return w/ all buffers mapped (calculates size/len) can't
> +	 * be used; and there's no dioread_nolock, so only ext4_get_block.
> +	 */
> +	if (ext4_should_journal_data(inode))
> +		goto retry_alloc;
> +
>  	/* Delalloc case is easy... */
>  	if (test_opt(inode->i_sb, DELALLOC) &&
> -	    !ext4_should_journal_data(inode) &&
>  	    !ext4_nonda_switch(inode->i_sb)) {
>  		do {
>  			err = block_page_mkwrite(vma, vmf,
> @@ -6005,6 +6013,9 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  	/*
>  	 * Return if we have all the buffers mapped. This avoids the need to do
>  	 * journal_start/journal_stop which can block and take a long time
> +	 *
> +	 * This cannot be done for data journalling, as we have to add the
> +	 * inode to the transaction's list to writeprotect pages on commit.
>  	 */
>  	if (page_has_buffers(page)) {
>  		if (!ext4_walk_page_buffers(NULL, page_buffers(page),
> @@ -6029,16 +6040,42 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  		ret = VM_FAULT_SIGBUS;
>  		goto out;
>  	}
> -	err = block_page_mkwrite(vma, vmf, get_block);
> -	if (!err && ext4_should_journal_data(inode)) {
> -		if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
> -			  PAGE_SIZE, NULL, do_journal_get_write_access)) {
> +	/*
> +	 * Data journalling can't use block_page_mkwrite() because it
> +	 * will set_buffer_dirty() before do_journal_get_write_access()
> +	 * thus might hit warning messages for dirty metadata buffers.
> +	 */
> +	if (!ext4_should_journal_data(inode)) {
> +		err = block_page_mkwrite(vma, vmf, get_block);
> +	} else {
> +		lock_page(page);
> +		size = i_size_read(inode);
> +		/* Page got truncated from under us? */
> +		if (page->mapping != mapping || page_offset(page) > size) {
>  			unlock_page(page);
> -			ret = VM_FAULT_SIGBUS;
> +			ret = VM_FAULT_NOPAGE;
>  			ext4_journal_stop(handle);
>  			goto out;
>  		}
> -		ext4_set_inode_state(inode, EXT4_STATE_JDATA);
> +
> +		if (page->index == size >> PAGE_SHIFT)
> +			len = size & ~PAGE_MASK;
> +		else
> +			len = PAGE_SIZE;
> +
> +		err = __block_write_begin(page, 0, len, ext4_get_block);
> +		if (!err) {
> +			if (ext4_walk_page_buffers(handle, page_buffers(page),
> +					0, len, NULL, do_journal_get_write_access)) {
> +				unlock_page(page);
> +				ret = VM_FAULT_SIGBUS;
> +				ext4_journal_stop(handle);
> +				goto out;
> +			}
> +			ext4_set_inode_state(inode, EXT4_STATE_JDATA);
> +		} else {
> +			unlock_page(page);
> +		}
>  	}
>  	ext4_journal_stop(handle);
>  	if (err == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> -- 
> 2.17.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v4 4/4] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()
  2020-09-28 19:41 ` [RFC PATCH v4 4/4] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-29 12:10   ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2020-09-29 12:10 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira; +Cc: Jan Kara, linux-ext4, dann frazier

On Mon 28-09-20 16:41:03, Mauricio Faria de Oliveira wrote:
> This implements journal callbacks j_submit|finish_inode_data_buffers()
> with different behavior for data=journal: to write-protect pages under
> commit, preventing changes to buffers writeably mapped to userspace.
> 
> If a buffer's content changes between commit's checksum calculation
> and write-out to disk, it can cause journal recovery/mount failures
> upon a kernel crash or power loss.
> 
>     [   27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support!
>     [   27.339492] JBD2: Invalid checksum recovering data block 8705 in log
>     [   27.342716] JBD2: recovery failed
>     [   27.343316] EXT4-fs (loop0): error loading journal
>     mount: /ext4: can't read superblock on /dev/loop0.
> 
> In j_submit_inode_data_buffers() we write-protect the inode's pages
> with write_cache_pages() and redirty w/ writepage callback if needed.
> 
> In j_finish_inode_data_buffers() there is nothing do to.
> 
> And in order to use the callbacks, inodes are added to the inode list
> in transaction in __ext4_journalled_writepage() and ext4_page_mkwrite().
> 
> In ext4_page_mkwrite() we must make sure that the buffers are attached
> to the transaction as jbddirty with write_end_fn(), as already done in
> __ext4_journalled_writepage().
> 
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> Reported-by: Dann Frazier <dann.frazier@canonical.com>
> Reported-by: kernel test robot <lkp@intel.com> # wbc.nr_to_write
> Suggested-by: Jan Kara <jack@suse.cz>

The patch looks good to me. Just one nit below. After fixing that feel free
to add:

Reviewed-by: Jan Kara <jack@suse.cz>

> + * However, we have to redirty a page in these cases:
> + * 1) some buffer is dirty (needs checkpointing)
> + * 2) some buffer is not part of the committing transaction
> + * 3) some buffer already has b_next_transaction set
> + */

Maybe I'd move this comment inside ext4_journalled_writepage_callback()
just before the if () to make it clear what it speaks about. I'd also
somewhat expand it like:

/*
 * However, we have to redirty a page in these cases:
 * 1) If buffer is dirty, it means the page was dirty because it contains a
 * buffer that needs checkpointing. So dirty bit needs to be preserved so
 * that checkpointing writes the buffer properly.
 * 2) If buffer is not part of the committing transaction (we may have just
 * accidentally come across this buffer because inode range tracking is not
 * exact) or if the currently running transaction already contains this
 * buffer as well, dirty bit needs to be preserved so that the buffer gets
 * properly writeprotected on running transaction's commit.
 */

> +
> +static int ext4_journalled_writepage_callback(struct page *page,
> +					      struct writeback_control *wbc,
> +					      void *data)
> +{
> +	transaction_t *transaction = (transaction_t *) data;
> +	struct buffer_head *bh, *head;
> +	struct journal_head *jh;
> +
> +	bh = head = page_buffers(page);
> +	do {
> +		jh = bh2jh(bh);
> +		if (buffer_dirty(bh) ||
> +			(jh && (jh->b_transaction != transaction ||
> +				jh->b_next_transaction))) {

Also we usually indent the condition like:
		if (buffer_dirty(bh) ||
		    (jh && (jh->b_transaction != transaction ||
			    jh->b_next_transaction))) {

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v4 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()
  2020-09-29  2:24   ` Andreas Dilger
@ 2020-09-30 21:36     ` Mauricio Faria de Oliveira
  0 siblings, 0 replies; 16+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-09-30 21:36 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Jan Kara, linux-ext4, dann frazier

On Mon, Sep 28, 2020 at 11:24 PM Andreas Dilger <adilger@dilger.ca> wrote:
>
> On Sep 28, 2020, at 1:41 PM, Mauricio Faria de Oliveira <mfo@canonical.com> wrote:
> >
> > Export functions that implement the current behavior done
> > for an inode in journal_submit|finish_inode_data_buffers().
> >
> > No functional change.
> >
> > Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> > Suggested-by: Jan Kara <jack@suse.cz>
> > Reviewed-by: Jan Kara <jack@suse.cz>
>
> A couple of minor cleanups below, but either way you could add:
>
> Reviewed-by: Andreas Dilger <adilger@dilger.ca>
>

Hey Andreas, thanks for reviewing!

These cleanups/style changes do look better -- applied to the two functions
in patch 1 (submit and finish), and another function in patch 4
(submit callback).

cheers,
Mauricio

> > +int jbd2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
> > +{
> > +     struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
> > +     loff_t dirty_start = jinode->i_dirty_start;
> > +     loff_t dirty_end = jinode->i_dirty_end;
> > +     int ret;
> > +
> > +     ret = filemap_fdatawait_range_keep_errors(mapping, dirty_start, dirty_end);
> > +     return ret;
> > +}
>
> (style) still prefer to wrap at 80 columns if possible.
> (style) there isn't any benefit to "dirty_start" and "dirty_end" as locals
> (style) there also isn't any benefit to "ret = ...; return ret"
>
> I thought it might be coded this way because the function is changed in a
> later patch in the series, but I couldn't find anything like that, so the
> shorter form is just as readable, IMHO:
>
> int jbd2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
> {
>         struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
>
>         return filemap_fdatawait_range_keep_errors(mapping,
>                                                    jinode->dirty_start,
>                                                    jinode->dirty_end);
> }
>
> Cheers, Andreas
>
>
>
>
>


-- 
Mauricio Faria de Oliveira

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit
  2020-09-29 11:37 ` [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit Jan Kara
@ 2020-09-30 22:59   ` Mauricio Faria de Oliveira
  2020-10-01  7:34     ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-09-30 22:59 UTC (permalink / raw)
  To: Jan Kara, Andreas Dilger; +Cc: linux-ext4, dann frazier

On Tue, Sep 29, 2020 at 8:37 AM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 28-09-20 16:40:59, Mauricio Faria de Oliveira wrote:
> > Hey Jan,
> >
> > This series implements your suggestions for the RFC PATCH v3 set [1].
> >
> > That addressed the issue you confirmed with block_page_mkwrite() [2].
> > There's no "JBD2: Spotted dirty metadata buffer" message in over 72h
> > runs across 3 VMs (it used to happen within a few hours.) *Thanks!*
> >
> > I added Reviewed-by: tags for the patches/changes you mentioned.
> > The only changes from v3 are patch 3 which is new, and contains
> > the 2 fixes to ext4_page_mkwrite(); and patch 4 changed 'struct
> > writeback_control.nr_to_write' from ~0ULL to LONG_MAX, since it
> > is signed long (~0ULL overflows to -1; kudos, kernel test robot!)
> >
> > It looks almost good on fstests: zero regressions on data=ordered,
> > and two apparently flaky tests data=journal (generic/347 _passed_
> > 1/6 times with the patch, and generic/547 failed 1/6 times.)
>
> Cool. Neither of these tests has anything to do with mmap. The first test
> checks what happens when thin provisioned storage runs out of space (and
> that fs remains consistent), the second tests that fsync flushed properly
> all data and that it can be seen after a crash. So I'm reasonably confident
> that it isn't caused by your patches. It still might be a bug in
> data=journal implementation though but that would be something for another
> patch series :).
>

Hey Jan,

That's good to hear! Now that the patchset seems to be in good shape,
I worked on testing it these last 2 days. Good and mixed-feelings news. :-)

1) For ext4 first, I have put 2 VMs to run fstests in a loop overnight,
(original and patched kernels, ~20 runs each). It looks like the patched VM
has more variance of failed/flaky tests, but the "average failure set" holds.

I think some of the failures were flaky or timing related, because when I ran
some tests, e.g. generic/371 a hundred times (./check -i 100 generic/371)
then it only failed 6 out of 100 times. So I didn't look much more into it, but
should you feel like recommending a more careful look, I'll be happy to do it.

For documentation purposes, the results on v5.9-rc7 and next-20200930,
showing no "permanent" regressions. Good news :)

    data=ordered:
    Failures: ext4/045 generic/044 generic/045 generic/046 generic/051
generic/223 generic/388 generic/465 generic/475 generic/553
generic/554 generic/555 generic/565 generic/611

    data=journal:
    Failures: ext4/045 generic/051 generic/223 generic/347 generic/388
generic/441 generic/475 generic/553 generic/554 generic/555
generic/565 generic/611

2) For OCFS2, I just had to change where we set the callbacks in (patch 2.)
(I'll include that in the next, hopefully non-RFC patchset, with
Andreas suggestions.)

Then a local mount also has no regressions on "stress-ng --class filesystem,io".
Good news too :)  For reference, the steps:

    # mkfs.ocfs2 --mount local $DEV
    # mount $DEV $MNT
    # cd $MNT
    # stress-ng --sequential 0 --class filesystem,io

3) Now, the mixed-feelings news.

The synthetic test-case/patches I had written clearly show that the
patchset works:
- In the original kernel, userspace can write to buffers during
commit; and it moves on.
- In the patched kernel, userspace cannot write to buffers during
commit; it blocks.

However, the heavy-hammer testing with 'stress-ng --mmap 4xNCPUs --mmap-file'
then crashing the kernel via sysrq-trigger, and trying to mount the
filesystem again,
sometimes still can find invalid checksums, thus journal recovery/mount fails.

    [   98.194809] JBD2: Invalid checksum recovering data block 109704 in log
    [   98.201853] JBD2: Invalid checksum recovering data block 69959 in log
    [   98.339859] JBD2: recovery failed
    [   98.340581] EXT4-fs (vdc): error loading journal

So, despite the test exercising mmap() and the patchset being for mmap(),
apparently there is more happening that also needs changes. (Weird; but
I will try to debug that test-case behavior deeper, to find what's going on.)

This patchset does address a problem, so should we move on with this one,
and as you mentioned, "that would be something for another patch series :)" ?

Thank you,
Mauricio



> I'll have a look at the remaining patches.
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR



-- 
Mauricio Faria de Oliveira

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit
  2020-09-30 22:59   ` Mauricio Faria de Oliveira
@ 2020-10-01  7:34     ` Jan Kara
  2020-10-01 12:46       ` Mauricio Faria de Oliveira
  0 siblings, 1 reply; 16+ messages in thread
From: Jan Kara @ 2020-10-01  7:34 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: Jan Kara, Andreas Dilger, linux-ext4, dann frazier

On Wed 30-09-20 19:59:44, Mauricio Faria de Oliveira wrote:
> On Tue, Sep 29, 2020 at 8:37 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Mon 28-09-20 16:40:59, Mauricio Faria de Oliveira wrote:
> > > Hey Jan,
> > >
> > > This series implements your suggestions for the RFC PATCH v3 set [1].
> > >
> > > That addressed the issue you confirmed with block_page_mkwrite() [2].
> > > There's no "JBD2: Spotted dirty metadata buffer" message in over 72h
> > > runs across 3 VMs (it used to happen within a few hours.) *Thanks!*
> > >
> > > I added Reviewed-by: tags for the patches/changes you mentioned.
> > > The only changes from v3 are patch 3 which is new, and contains
> > > the 2 fixes to ext4_page_mkwrite(); and patch 4 changed 'struct
> > > writeback_control.nr_to_write' from ~0ULL to LONG_MAX, since it
> > > is signed long (~0ULL overflows to -1; kudos, kernel test robot!)
> > >
> > > It looks almost good on fstests: zero regressions on data=ordered,
> > > and two apparently flaky tests data=journal (generic/347 _passed_
> > > 1/6 times with the patch, and generic/547 failed 1/6 times.)
> >
> > Cool. Neither of these tests has anything to do with mmap. The first test
> > checks what happens when thin provisioned storage runs out of space (and
> > that fs remains consistent), the second tests that fsync flushed properly
> > all data and that it can be seen after a crash. So I'm reasonably confident
> > that it isn't caused by your patches. It still might be a bug in
> > data=journal implementation though but that would be something for another
> > patch series :).
> >
> 
> Hey Jan,
> 
> That's good to hear! Now that the patchset seems to be in good shape,
> I worked on testing it these last 2 days. Good and mixed-feelings news. :-)
> 
> 1) For ext4 first, I have put 2 VMs to run fstests in a loop overnight,
> (original and patched kernels, ~20 runs each). It looks like the patched VM
> has more variance of failed/flaky tests, but the "average failure set" holds.
> 
> I think some of the failures were flaky or timing related, because when I ran
> some tests, e.g. generic/371 a hundred times (./check -i 100 generic/371)
> then it only failed 6 out of 100 times. So I didn't look much more into it, but
> should you feel like recommending a more careful look, I'll be happy to do it.
> 
> For documentation purposes, the results on v5.9-rc7 and next-20200930,
> showing no "permanent" regressions. Good news :)
> 
>     data=ordered:
>     Failures: ext4/045 generic/044 generic/045 generic/046 generic/051
> generic/223 generic/388 generic/465 generic/475 generic/553
> generic/554 generic/555 generic/565 generic/611
> 
>     data=journal:
>     Failures: ext4/045 generic/051 generic/223 generic/347 generic/388
> generic/441 generic/475 generic/553 generic/554 generic/555
> generic/565 generic/611
> 
> 2) For OCFS2, I just had to change where we set the callbacks in (patch 2.)
> (I'll include that in the next, hopefully non-RFC patchset, with
> Andreas suggestions.)
> 
> Then a local mount also has no regressions on "stress-ng --class filesystem,io".
> Good news too :)  For reference, the steps:
> 
>     # mkfs.ocfs2 --mount local $DEV
>     # mount $DEV $MNT
>     # cd $MNT
>     # stress-ng --sequential 0 --class filesystem,io

OK, these look sane. Nothing that would particularly worry me wrt this
patch set although some of those errors (e.g. the flaky generic/371 test
or generic/441 failure) would warrant closer look.

> 3) Now, the mixed-feelings news.
> 
> The synthetic test-case/patches I had written clearly show that the
> patchset works:
> - In the original kernel, userspace can write to buffers during
> commit; and it moves on.
> - In the patched kernel, userspace cannot write to buffers during
> commit; it blocks.
> 
> However, the heavy-hammer testing with 'stress-ng --mmap 4xNCPUs --mmap-file'
> then crashing the kernel via sysrq-trigger, and trying to mount the
> filesystem again,
> sometimes still can find invalid checksums, thus journal recovery/mount fails.
> 
>     [   98.194809] JBD2: Invalid checksum recovering data block 109704 in log
>     [   98.201853] JBD2: Invalid checksum recovering data block 69959 in log
>     [   98.339859] JBD2: recovery failed
>     [   98.340581] EXT4-fs (vdc): error loading journal
> 
> So, despite the test exercising mmap() and the patchset being for mmap(),
> apparently there is more happening that also needs changes. (Weird; but
> I will try to debug that test-case behavior deeper, to find what's going on.)
> 
> This patchset does address a problem, so should we move on with this one,
> and as you mentioned, "that would be something for another patch series :)" ?

Thanks for the really throughout testing! If you can debug where the
problem is still lurking fast, then cool, we can still fix it in this patch
series. If not, then I'm fine with just pushing what we have because
conceptually that seems like a sane thing to do anyway and we can fix the
remaining problem afterwards.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit
  2020-10-01  7:34     ` Jan Kara
@ 2020-10-01 12:46       ` Mauricio Faria de Oliveira
  2020-10-02  8:39         ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-10-01 12:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: Andreas Dilger, linux-ext4, dann frazier

On Thu, Oct 1, 2020 at 4:34 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 30-09-20 19:59:44, Mauricio Faria de Oliveira wrote:
> > On Tue, Sep 29, 2020 at 8:37 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Mon 28-09-20 16:40:59, Mauricio Faria de Oliveira wrote:
> > > > Hey Jan,
> > > >
> > > > This series implements your suggestions for the RFC PATCH v3 set [1].
> > > >
> > > > That addressed the issue you confirmed with block_page_mkwrite() [2].
> > > > There's no "JBD2: Spotted dirty metadata buffer" message in over 72h
> > > > runs across 3 VMs (it used to happen within a few hours.) *Thanks!*
> > > >
> > > > I added Reviewed-by: tags for the patches/changes you mentioned.
> > > > The only changes from v3 are patch 3 which is new, and contains
> > > > the 2 fixes to ext4_page_mkwrite(); and patch 4 changed 'struct
> > > > writeback_control.nr_to_write' from ~0ULL to LONG_MAX, since it
> > > > is signed long (~0ULL overflows to -1; kudos, kernel test robot!)
> > > >
> > > > It looks almost good on fstests: zero regressions on data=ordered,
> > > > and two apparently flaky tests data=journal (generic/347 _passed_
> > > > 1/6 times with the patch, and generic/547 failed 1/6 times.)
> > >
> > > Cool. Neither of these tests has anything to do with mmap. The first test
> > > checks what happens when thin provisioned storage runs out of space (and
> > > that fs remains consistent), the second tests that fsync flushed properly
> > > all data and that it can be seen after a crash. So I'm reasonably confident
> > > that it isn't caused by your patches. It still might be a bug in
> > > data=journal implementation though but that would be something for another
> > > patch series :).
> > >
> >
> > Hey Jan,
> >
> > That's good to hear! Now that the patchset seems to be in good shape,
> > I worked on testing it these last 2 days. Good and mixed-feelings news. :-)
> >
> > 1) For ext4 first, I have put 2 VMs to run fstests in a loop overnight,
> > (original and patched kernels, ~20 runs each). It looks like the patched VM
> > has more variance of failed/flaky tests, but the "average failure set" holds.
> >
> > I think some of the failures were flaky or timing related, because when I ran
> > some tests, e.g. generic/371 a hundred times (./check -i 100 generic/371)
> > then it only failed 6 out of 100 times. So I didn't look much more into it, but
> > should you feel like recommending a more careful look, I'll be happy to do it.
> >
> > For documentation purposes, the results on v5.9-rc7 and next-20200930,
> > showing no "permanent" regressions. Good news :)
> >
> >     data=ordered:
> >     Failures: ext4/045 generic/044 generic/045 generic/046 generic/051
> > generic/223 generic/388 generic/465 generic/475 generic/553
> > generic/554 generic/555 generic/565 generic/611
> >
> >     data=journal:
> >     Failures: ext4/045 generic/051 generic/223 generic/347 generic/388
> > generic/441 generic/475 generic/553 generic/554 generic/555
> > generic/565 generic/611
> >
> > 2) For OCFS2, I just had to change where we set the callbacks in (patch 2.)
> > (I'll include that in the next, hopefully non-RFC patchset, with
> > Andreas suggestions.)
> >
> > Then a local mount also has no regressions on "stress-ng --class filesystem,io".
> > Good news too :)  For reference, the steps:
> >
> >     # mkfs.ocfs2 --mount local $DEV
> >     # mount $DEV $MNT
> >     # cd $MNT
> >     # stress-ng --sequential 0 --class filesystem,io
>
> OK, these look sane. Nothing that would particularly worry me wrt this
> patch set although some of those errors (e.g. the flaky generic/371 test
> or generic/441 failure) would warrant closer look.
>

Sure, I'll check the flaky ones and follow up w/ a more detailed report.

Just to clarify on generic/441, it's not a regression or flaky; it fails
consistently on original/patched kernels (above lists apply to both.)

But absolutely generic/371 failing randomly on pwrite() _is_ interesting.

> > 3) Now, the mixed-feelings news.
> >
> > The synthetic test-case/patches I had written clearly show that the
> > patchset works:
> > - In the original kernel, userspace can write to buffers during
> > commit; and it moves on.
> > - In the patched kernel, userspace cannot write to buffers during
> > commit; it blocks.
> >
> > However, the heavy-hammer testing with 'stress-ng --mmap 4xNCPUs --mmap-file'
> > then crashing the kernel via sysrq-trigger, and trying to mount the
> > filesystem again,
> > sometimes still can find invalid checksums, thus journal recovery/mount fails.
> >
> >     [   98.194809] JBD2: Invalid checksum recovering data block 109704 in log
> >     [   98.201853] JBD2: Invalid checksum recovering data block 69959 in log
> >     [   98.339859] JBD2: recovery failed
> >     [   98.340581] EXT4-fs (vdc): error loading journal
> >
> > So, despite the test exercising mmap() and the patchset being for mmap(),
> > apparently there is more happening that also needs changes. (Weird; but
> > I will try to debug that test-case behavior deeper, to find what's going on.)
> >
> > This patchset does address a problem, so should we move on with this one,
> > and as you mentioned, "that would be something for another patch series :)" ?
>
> Thanks for the really throughout testing! If you can debug where the
> problem is still lurking fast, then cool, we can still fix it in this patch
> series. If not, then I'm fine with just pushing what we have because
> conceptually that seems like a sane thing to do anyway and we can fix the
> remaining problem afterwards.

Understood. I'll be able to look at this next week, which should be rc8 [1].
Would it be good enough, timing wise, to send a non-RFC series with
what we have (this other issue fixed or not) by the end of next week?

Thanks!
Mauricio

[1] https://lore.kernel.org/linux-ext4/CAHk-=whAe_n6JDyu40A15vnWs5PTU0QYX6t6-TbNeefanau6MA@mail.gmail.com/


>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

--
Mauricio Faria de Oliveira

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit
  2020-10-01 12:46       ` Mauricio Faria de Oliveira
@ 2020-10-02  8:39         ` Jan Kara
  0 siblings, 0 replies; 16+ messages in thread
From: Jan Kara @ 2020-10-02  8:39 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: Jan Kara, Andreas Dilger, linux-ext4, dann frazier, Ted Tso

On Thu 01-10-20 09:46:32, Mauricio Faria de Oliveira wrote:
> On Thu, Oct 1, 2020 at 4:34 AM Jan Kara <jack@suse.cz> wrote:
> > On Wed 30-09-20 19:59:44, Mauricio Faria de Oliveira wrote:
> > > 3) Now, the mixed-feelings news.
> > >
> > > The synthetic test-case/patches I had written clearly show that the
> > > patchset works:
> > > - In the original kernel, userspace can write to buffers during
> > > commit; and it moves on.
> > > - In the patched kernel, userspace cannot write to buffers during
> > > commit; it blocks.
> > >
> > > However, the heavy-hammer testing with 'stress-ng --mmap 4xNCPUs --mmap-file'
> > > then crashing the kernel via sysrq-trigger, and trying to mount the
> > > filesystem again,
> > > sometimes still can find invalid checksums, thus journal recovery/mount fails.
> > >
> > >     [   98.194809] JBD2: Invalid checksum recovering data block 109704 in log
> > >     [   98.201853] JBD2: Invalid checksum recovering data block 69959 in log
> > >     [   98.339859] JBD2: recovery failed
> > >     [   98.340581] EXT4-fs (vdc): error loading journal
> > >
> > > So, despite the test exercising mmap() and the patchset being for mmap(),
> > > apparently there is more happening that also needs changes. (Weird; but
> > > I will try to debug that test-case behavior deeper, to find what's going on.)
> > >
> > > This patchset does address a problem, so should we move on with this one,
> > > and as you mentioned, "that would be something for another patch series :)" ?
> >
> > Thanks for the really throughout testing! If you can debug where the
> > problem is still lurking fast, then cool, we can still fix it in this patch
> > series. If not, then I'm fine with just pushing what we have because
> > conceptually that seems like a sane thing to do anyway and we can fix the
> > remaining problem afterwards.
> 
> Understood. I'll be able to look at this next week, which should be rc8 [1].
> Would it be good enough, timing wise, to send a non-RFC series with
> what we have (this other issue fixed or not) by the end of next week?

This is more a question for Ted as a maintainer (CCed) but end of next week
is probably too late because Ted needs time to merge the patches in his
tree, run his battery of tests, push changes to linux-next and let them
simmer there for a while before sending them to Linus. So I'd say submit
what you have on Monday / Tuesday and we can always add fixes on top as we
find them.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2020-10-02  8:39 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-28 19:40 [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
2020-09-28 19:41 ` [RFC PATCH v4 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
2020-09-29  2:24   ` Andreas Dilger
2020-09-30 21:36     ` Mauricio Faria de Oliveira
2020-09-28 19:41 ` [RFC PATCH v4 2/4] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
2020-09-29  2:28   ` Andreas Dilger
2020-09-28 19:41 ` [RFC PATCH v4 3/4] ext4: data=journal: fixes for ext4_page_mkwrite() Mauricio Faria de Oliveira
2020-09-29  2:34   ` Andreas Dilger
2020-09-29 11:48   ` Jan Kara
2020-09-28 19:41 ` [RFC PATCH v4 4/4] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() Mauricio Faria de Oliveira
2020-09-29 12:10   ` Jan Kara
2020-09-29 11:37 ` [RFC PATCH v4 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit Jan Kara
2020-09-30 22:59   ` Mauricio Faria de Oliveira
2020-10-01  7:34     ` Jan Kara
2020-10-01 12:46       ` Mauricio Faria de Oliveira
2020-10-02  8:39         ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.