OCFS2-Devel Archive on lore.kernel.org
 help / color / Atom feed
* [Ocfs2-devel] [PATCH v5 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit
@ 2020-10-06  0:48 Mauricio Faria de Oliveira
  2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-10-06  0:48 UTC (permalink / raw)
  To: linux-ext4, ocfs2-devel; +Cc: Jan Kara, Andreas Dilger, dann frazier, Joseph Qi

Hey Jan, Andreas, Ted et al.

This series fixes the issue that buffers writeably mapped to userspace
can be modified during transaction commit, in between checksumming and
write-out to the journal, thus cause inconsistency in the journal that
prevents recovery/mount after kernel crash or power loss.

It's really ideas and patience from Jan guiding me how to write/fix it.
Huge thanks!

Although the synthetic test case in [v2] demonstrates the bug and that
the fix works, there's stress-ng tests that still causes inconsistency.

Unfortunately I couldn't look more into it yet; per discussion in [v4]
cover letter I'll send it as-is now, since it already fixes something,
and will continue to analyze. It might be something else/another fix.

There's also some fstests that _apparently_ become more flaky w/ the
patchset applied, but don't seem to be mmap() related.. so I'll look
at them more carefully too. The set of consistent failures (ie, that
happen most of the time) is the same between original/patched builds,
on both data=ordered and data=journal (different for each mode, ofc.)

    data=ordered:
    Failures: ext4/045 generic/044 generic/045 generic/046 generic/051
generic/223 generic/388 generic/465 generic/475 generic/553
generic/554 generic/555 generic/565 generic/611

    data=journal:
    Failures: ext4/045 generic/051 generic/223 generic/347 generic/388
generic/441 generic/475 generic/553 generic/554 generic/555
generic/565 generic/611

There's a small change to OCFS2 in patch 2, which has been tested w/
stress-ng's filesystem and io stressor classes; no regressions found.

    # mkfs.ocfs2 --mount local $DEV
    # mount $DEV $MNT
    # cd $MNT
    # stress-ng --sequential 0 --class filesystem,io

The only changes from v4 are style-change suggestions from Andreas
applied to patches 02/04 and where we set OCFS2 journal callbacks;
plus Reviewed-By: tags.

Tested on v5.9-rc7'ish and next-20200930; build tested on -rc8'ish
and next-20201002 today.

cheers,
Mauricio

[v4] https://urldefense.com/v3/__https://lore.kernel.org/linux-ext4/20200928194103.244692-1-mfo at canonical.com/__;!!GqivPVa7Brio!KLi18zziDAgi-C8ol7nWzoPGTJ3JSrOijN1MmSjdZ3Zw9wnZNnCNYgoJq7LjaeoH9QUyCg$ 
[v3] https://urldefense.com/v3/__https://lore.kernel.org/linux-ext4/20200910193127.276214-1-mfo at canonical.com/__;!!GqivPVa7Brio!KLi18zziDAgi-C8ol7nWzoPGTJ3JSrOijN1MmSjdZ3Zw9wnZNnCNYgoJq7LjaeodPYbH2w$ 
[v2] https://urldefense.com/v3/__https://lore.kernel.org/linux-ext4/20200810010210.3305322-1-mfo at canonical.com/__;!!GqivPVa7Brio!KLi18zziDAgi-C8ol7nWzoPGTJ3JSrOijN1MmSjdZ3Zw9wnZNnCNYgoJq7LjaeocoAghAA$ 
[v1] https://urldefense.com/v3/__https://lore.kernel.org/linux-ext4/20200423233705.5878-1-mfo at canonical.com/__;!!GqivPVa7Brio!KLi18zziDAgi-C8ol7nWzoPGTJ3JSrOijN1MmSjdZ3Zw9wnZNnCNYgoJq7LjaepKEikFzg$ 

Mauricio Faria de Oliveira (4):
  jbd2: introduce/export functions
    jbd2_journal_submit|finish_inode_data_buffers()
  jbd2, ext4, ocfs2: introduce/use journal callbacks
    j_submit|finish_inode_data_buffers()
  ext4: data=journal: fixes for ext4_page_mkwrite()
  ext4: data=journal: write-protect pages on
    j_submit_inode_data_buffers()

 fs/ext4/inode.c      | 62 ++++++++++++++++++++++++++-----
 fs/ext4/super.c      | 87 ++++++++++++++++++++++++++++++++++++++++++++
 fs/jbd2/commit.c     | 62 ++++++++++++++++---------------
 fs/jbd2/journal.c    |  2 +
 fs/ocfs2/journal.c   |  4 ++
 include/linux/jbd2.h | 29 ++++++++++++++-
 6 files changed, 206 insertions(+), 40 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH v5 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()
  2020-10-06  0:48 [Ocfs2-devel] [PATCH v5 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
@ 2020-10-06  0:48 ` Mauricio Faria de Oliveira
  2020-10-09  2:05   ` Theodore Y. Ts'o
  2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 2/4] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-10-06  0:48 UTC (permalink / raw)
  To: linux-ext4, ocfs2-devel; +Cc: Jan Kara, Andreas Dilger, dann frazier, Joseph Qi

Export functions that implement the current behavior done
for an inode in journal_submit|finish_inode_data_buffers().

No functional change.

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
---
 fs/jbd2/commit.c     | 36 ++++++++++++++++--------------------
 fs/jbd2/journal.c    |  2 ++
 include/linux/jbd2.h |  4 ++++
 3 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 6d2da8ad0e6f..f79b86b4241f 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -187,19 +187,17 @@ static int journal_wait_on_commit_record(journal_t *journal,
  * use writepages() because with delayed allocation we may be doing
  * block allocation in writepages().
  */
-static int journal_submit_inode_data_buffers(struct address_space *mapping,
-		loff_t dirty_start, loff_t dirty_end)
+int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
 {
-	int ret;
+	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
 	struct writeback_control wbc = {
 		.sync_mode =  WB_SYNC_ALL,
 		.nr_to_write = mapping->nrpages * 2,
-		.range_start = dirty_start,
-		.range_end = dirty_end,
+		.range_start = jinode->i_dirty_start,
+		.range_end = jinode->i_dirty_end,
 	};
 
-	ret = generic_writepages(mapping, &wbc);
-	return ret;
+	return generic_writepages(mapping, &wbc);
 }
 
 /*
@@ -215,16 +213,11 @@ static int journal_submit_data_buffers(journal_t *journal,
 {
 	struct jbd2_inode *jinode;
 	int err, ret = 0;
-	struct address_space *mapping;
 
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
-		loff_t dirty_start = jinode->i_dirty_start;
-		loff_t dirty_end = jinode->i_dirty_end;
-
 		if (!(jinode->i_flags & JI_WRITE_DATA))
 			continue;
-		mapping = jinode->i_vfs_inode->i_mapping;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
 		/*
@@ -234,8 +227,7 @@ static int journal_submit_data_buffers(journal_t *journal,
 		 * only allocated blocks here.
 		 */
 		trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
-		err = journal_submit_inode_data_buffers(mapping, dirty_start,
-				dirty_end);
+		err = jbd2_journal_submit_inode_data_buffers(jinode);
 		if (!ret)
 			ret = err;
 		spin_lock(&journal->j_list_lock);
@@ -248,6 +240,15 @@ static int journal_submit_data_buffers(journal_t *journal,
 	return ret;
 }
 
+int jbd2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+
+	return filemap_fdatawait_range_keep_errors(mapping,
+						   jinode->i_dirty_start,
+						   jinode->i_dirty_end);
+}
+
 /*
  * Wait for data submitted for writeout, refile inodes to proper
  * transaction if needed.
@@ -262,16 +263,11 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
 	/* For locking, see the comment in journal_submit_data_buffers() */
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
-		loff_t dirty_start = jinode->i_dirty_start;
-		loff_t dirty_end = jinode->i_dirty_end;
-
 		if (!(jinode->i_flags & JI_WAIT_DATA))
 			continue;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
-		err = filemap_fdatawait_range_keep_errors(
-				jinode->i_vfs_inode->i_mapping, dirty_start,
-				dirty_end);
+		err = jbd2_journal_finish_inode_data_buffers(jinode);
 		if (!ret)
 			ret = err;
 		spin_lock(&journal->j_list_lock);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 17fdc482f554..c0600405e7a2 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -91,6 +91,8 @@ EXPORT_SYMBOL(jbd2_journal_try_to_free_buffers);
 EXPORT_SYMBOL(jbd2_journal_force_commit);
 EXPORT_SYMBOL(jbd2_journal_inode_ranged_write);
 EXPORT_SYMBOL(jbd2_journal_inode_ranged_wait);
+EXPORT_SYMBOL(jbd2_journal_submit_inode_data_buffers);
+EXPORT_SYMBOL(jbd2_journal_finish_inode_data_buffers);
 EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 08f904943ab2..2865a5475888 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1421,6 +1421,10 @@ extern int	   jbd2_journal_inode_ranged_write(handle_t *handle,
 extern int	   jbd2_journal_inode_ranged_wait(handle_t *handle,
 			struct jbd2_inode *inode, loff_t start_byte,
 			loff_t length);
+extern int	   jbd2_journal_submit_inode_data_buffers(
+			struct jbd2_inode *jinode);
+extern int	   jbd2_journal_finish_inode_data_buffers(
+			struct jbd2_inode *jinode);
 extern int	   jbd2_journal_begin_ordered_truncate(journal_t *journal,
 				struct jbd2_inode *inode, loff_t new_size);
 extern void	   jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
-- 
2.17.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH v5 2/4] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()
  2020-10-06  0:48 [Ocfs2-devel] [PATCH v5 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
  2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-10-06  0:48 ` Mauricio Faria de Oliveira
  2020-10-09  2:08   ` Theodore Y. Ts'o
  2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 3/4] ext4: data=journal: fixes for ext4_page_mkwrite() Mauricio Faria de Oliveira
  2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 4/4] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() Mauricio Faria de Oliveira
  3 siblings, 1 reply; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-10-06  0:48 UTC (permalink / raw)
  To: linux-ext4, ocfs2-devel; +Cc: Jan Kara, Andreas Dilger, dann frazier, Joseph Qi

Introduce journal callbacks to allow different behaviors
for an inode in journal_submit|finish_inode_data_buffers().

The existing users of the current behavior (ext4, ocfs2)
are adapted to use the previously exported functions
that implement the current behavior.

Users are callers of jbd2_journal_inode_ranged_write|wait(),
which adds the inode to the transaction's inode list with
the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree.

Both CONFIG_EXT4_FS and CONFIG_OCSFS2_FS select CONFIG_JBD2,
which builds fs/jbd2/commit.c and journal.c that define and
export the functions, so we can call directly in ext4/ocfs2.

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
---
 fs/ext4/super.c      |  4 ++++
 fs/jbd2/commit.c     | 30 ++++++++++++++++++------------
 fs/ocfs2/journal.c   |  4 ++++
 include/linux/jbd2.h | 25 ++++++++++++++++++++++++-
 4 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index ea425b49b345..a14c1ed39aa3 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4646,6 +4646,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
 
 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
+	sbi->s_journal->j_submit_inode_data_buffers =
+		jbd2_journal_submit_inode_data_buffers;
+	sbi->s_journal->j_finish_inode_data_buffers =
+		jbd2_journal_finish_inode_data_buffers;
 
 no_journal:
 	if (!test_opt(sb, NO_MBCACHE)) {
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index f79b86b4241f..6252b4c50666 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -197,6 +197,12 @@ int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
 		.range_end = jinode->i_dirty_end,
 	};
 
+	/*
+	 * submit the inode data buffers. We use writepage
+	 * instead of writepages. Because writepages can do
+	 * block allocation with delalloc. We need to write
+	 * only allocated blocks here.
+	 */
 	return generic_writepages(mapping, &wbc);
 }
 
@@ -220,16 +226,13 @@ static int journal_submit_data_buffers(journal_t *journal,
 			continue;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
-		/*
-		 * submit the inode data buffers. We use writepage
-		 * instead of writepages. Because writepages can do
-		 * block allocation  with delalloc. We need to write
-		 * only allocated blocks here.
-		 */
+		/* submit the inode data buffers. */
 		trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
-		err = jbd2_journal_submit_inode_data_buffers(jinode);
-		if (!ret)
-			ret = err;
+		if (journal->j_submit_inode_data_buffers) {
+			err = journal->j_submit_inode_data_buffers(jinode);
+			if (!ret)
+				ret = err;
+		}
 		spin_lock(&journal->j_list_lock);
 		J_ASSERT(jinode->i_transaction == commit_transaction);
 		jinode->i_flags &= ~JI_COMMIT_RUNNING;
@@ -267,9 +270,12 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
 			continue;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
-		err = jbd2_journal_finish_inode_data_buffers(jinode);
-		if (!ret)
-			ret = err;
+		/* wait for the inode data buffers writeout. */
+		if (journal->j_finish_inode_data_buffers) {
+			err = journal->j_finish_inode_data_buffers(jinode);
+			if (!ret)
+				ret = err;
+		}
 		spin_lock(&journal->j_list_lock);
 		jinode->i_flags &= ~JI_COMMIT_RUNNING;
 		smp_mb();
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index b425f0b01dce..b9a9d69dde7e 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -883,6 +883,10 @@ int ocfs2_journal_init(struct ocfs2_journal *journal, int *dirty)
 		  OCFS2_JOURNAL_DIRTY_FL);
 
 	journal->j_journal = j_journal;
+	journal->j_journal->j_submit_inode_data_buffers =
+		jbd2_journal_submit_inode_data_buffers;
+	journal->j_journal->j_finish_inode_data_buffers =
+		jbd2_journal_finish_inode_data_buffers;
 	journal->j_inode = inode;
 	journal->j_bh = bh;
 
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 2865a5475888..4aaa408c0ca7 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -629,7 +629,9 @@ struct transaction_s
 	struct journal_head	*t_shadow_list;
 
 	/*
-	 * List of inodes whose data we've modified in data=ordered mode.
+	 * List of inodes associated with the transaction; e.g., ext4 uses
+	 * this to track inodes in data=ordered and data=journal mode that
+	 * need special handling on transaction commit; also used by ocfs2.
 	 * [j_list_lock]
 	 */
 	struct list_head	t_inode_list;
@@ -1111,6 +1113,27 @@ struct journal_s
 	void			(*j_commit_callback)(journal_t *,
 						     transaction_t *);
 
+	/**
+	 * @j_submit_inode_data_buffers:
+	 *
+	 * This function is called for all inodes associated with the
+	 * committing transaction marked with JI_WRITE_DATA flag
+	 * before we start to write out the transaction to the journal.
+	 */
+	int			(*j_submit_inode_data_buffers)
+					(struct jbd2_inode *);
+
+	/**
+	 * @j_finish_inode_data_buffers:
+	 *
+	 * This function is called for all inodes associated with the
+	 * committing transaction marked with JI_WAIT_DATA flag
+	 * after we have written the transaction to the journal
+	 * but before we write out the commit block.
+	 */
+	int			(*j_finish_inode_data_buffers)
+					(struct jbd2_inode *);
+
 	/*
 	 * Journal statistics
 	 */
-- 
2.17.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH v5 3/4] ext4: data=journal: fixes for ext4_page_mkwrite()
  2020-10-06  0:48 [Ocfs2-devel] [PATCH v5 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
  2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
  2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 2/4] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-10-06  0:48 ` Mauricio Faria de Oliveira
  2020-10-09  2:10   ` Theodore Y. Ts'o
  2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 4/4] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() Mauricio Faria de Oliveira
  3 siblings, 1 reply; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-10-06  0:48 UTC (permalink / raw)
  To: linux-ext4, ocfs2-devel; +Cc: Jan Kara, Andreas Dilger, dann frazier, Joseph Qi

These are two fixes for data journalling required by
the next patch, discovered while testing it.

First, the optimization to return early if all buffers
are mapped is not appropriate for the next patch:

The inode _must_ be added to the transaction's list in
data=journal mode (so to write-protect pages on commit)
thus we cannot return early there.

Second, once that optimization to reduce transactions
was disabled for data=journal mode, more transactions
happened, and occasionally hit this warning message:
'JBD2: Spotted dirty metadata buffer'.

Reason is, block_page_mkwrite() will set_buffer_dirty()
before do_journal_get_write_access() that is there to
prevent it. This issue was masked by the optimization.

So, on data=journal use __block_write_begin() instead.
This also requires page locking and len recalculation.
(see block_page_mkwrite() for implementation details.)

Finally, as Jan noted there is little sharing between
data=journal and other modes in ext4_page_mkwrite().

However, a prototype of ext4_journalled_page_mkwrite()
showed there still would be lots of duplicated lines
(tens of) that didn't seem worth it.

Thus this patch ends up with an ugly goto to skip all
non-data journalling code (to avoid long indentations,
but that can be changed..) in the beginning, and just
a conditional in the transaction section.

Well, we skip a common part to data journalling which
is the page truncated check, but we do it again after
ext4_journal_start() when we re-acquire the page lock
(so not to acquire the page lock twice needlessly for
data journalling.)

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
---
 fs/ext4/inode.c | 51 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 44 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bf596467c234..ac153e340a6f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5977,9 +5977,17 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 	if (err)
 		goto out_ret;
 
+	/*
+	 * On data journalling we skip straight to the transaction handle:
+	 * there's no delalloc; page truncated will be checked later; the
+	 * early return w/ all buffers mapped (calculates size/len) can't
+	 * be used; and there's no dioread_nolock, so only ext4_get_block.
+	 */
+	if (ext4_should_journal_data(inode))
+		goto retry_alloc;
+
 	/* Delalloc case is easy... */
 	if (test_opt(inode->i_sb, DELALLOC) &&
-	    !ext4_should_journal_data(inode) &&
 	    !ext4_nonda_switch(inode->i_sb)) {
 		do {
 			err = block_page_mkwrite(vma, vmf,
@@ -6005,6 +6013,9 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 	/*
 	 * Return if we have all the buffers mapped. This avoids the need to do
 	 * journal_start/journal_stop which can block and take a long time
+	 *
+	 * This cannot be done for data journalling, as we have to add the
+	 * inode to the transaction's list to writeprotect pages on commit.
 	 */
 	if (page_has_buffers(page)) {
 		if (!ext4_walk_page_buffers(NULL, page_buffers(page),
@@ -6029,16 +6040,42 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 		ret = VM_FAULT_SIGBUS;
 		goto out;
 	}
-	err = block_page_mkwrite(vma, vmf, get_block);
-	if (!err && ext4_should_journal_data(inode)) {
-		if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
-			  PAGE_SIZE, NULL, do_journal_get_write_access)) {
+	/*
+	 * Data journalling can't use block_page_mkwrite() because it
+	 * will set_buffer_dirty() before do_journal_get_write_access()
+	 * thus might hit warning messages for dirty metadata buffers.
+	 */
+	if (!ext4_should_journal_data(inode)) {
+		err = block_page_mkwrite(vma, vmf, get_block);
+	} else {
+		lock_page(page);
+		size = i_size_read(inode);
+		/* Page got truncated from under us? */
+		if (page->mapping != mapping || page_offset(page) > size) {
 			unlock_page(page);
-			ret = VM_FAULT_SIGBUS;
+			ret = VM_FAULT_NOPAGE;
 			ext4_journal_stop(handle);
 			goto out;
 		}
-		ext4_set_inode_state(inode, EXT4_STATE_JDATA);
+
+		if (page->index == size >> PAGE_SHIFT)
+			len = size & ~PAGE_MASK;
+		else
+			len = PAGE_SIZE;
+
+		err = __block_write_begin(page, 0, len, ext4_get_block);
+		if (!err) {
+			if (ext4_walk_page_buffers(handle, page_buffers(page),
+					0, len, NULL, do_journal_get_write_access)) {
+				unlock_page(page);
+				ret = VM_FAULT_SIGBUS;
+				ext4_journal_stop(handle);
+				goto out;
+			}
+			ext4_set_inode_state(inode, EXT4_STATE_JDATA);
+		} else {
+			unlock_page(page);
+		}
 	}
 	ext4_journal_stop(handle);
 	if (err == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
-- 
2.17.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH v5 4/4] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()
  2020-10-06  0:48 [Ocfs2-devel] [PATCH v5 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
                   ` (2 preceding siblings ...)
  2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 3/4] ext4: data=journal: fixes for ext4_page_mkwrite() Mauricio Faria de Oliveira
@ 2020-10-06  0:48 ` Mauricio Faria de Oliveira
  2020-10-09  2:10   ` Theodore Y. Ts'o
  3 siblings, 1 reply; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-10-06  0:48 UTC (permalink / raw)
  To: linux-ext4, ocfs2-devel; +Cc: Jan Kara, Andreas Dilger, dann frazier, Joseph Qi

This implements journal callbacks j_submit|finish_inode_data_buffers()
with different behavior for data=journal: to write-protect pages under
commit, preventing changes to buffers writeably mapped to userspace.

If a buffer's content changes between commit's checksum calculation
and write-out to disk, it can cause journal recovery/mount failures
upon a kernel crash or power loss.

    [   27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support!
    [   27.339492] JBD2: Invalid checksum recovering data block 8705 in log
    [   27.342716] JBD2: recovery failed
    [   27.343316] EXT4-fs (loop0): error loading journal
    mount: /ext4: can't read superblock on /dev/loop0.

In j_submit_inode_data_buffers() we write-protect the inode's pages
with write_cache_pages() and redirty w/ writepage callback if needed.

In j_finish_inode_data_buffers() there is nothing do to.

And in order to use the callbacks, inodes are added to the inode list
in transaction in __ext4_journalled_writepage() and ext4_page_mkwrite().

In ext4_page_mkwrite() we must make sure that the buffers are attached
to the transaction as jbddirty with write_end_fn(), as already done in
__ext4_journalled_writepage().

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Reported-by: Dann Frazier <dann.frazier@canonical.com>
Reported-by: kernel test robot <lkp@intel.com> # wbc.nr_to_write
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c | 25 +++++++++-----
 fs/ext4/super.c | 87 +++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 101 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ac153e340a6f..af5de62c1214 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1910,6 +1910,9 @@ static int __ext4_journalled_writepage(struct page *page,
 		err = ext4_walk_page_buffers(handle, page_bufs, 0, len, NULL,
 					     write_end_fn);
 	}
+	if (ret == 0)
+		ret = err;
+	err = ext4_jbd2_inode_add_write(handle, inode, 0, len);
 	if (ret == 0)
 		ret = err;
 	EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
@@ -6052,10 +6055,8 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 		size = i_size_read(inode);
 		/* Page got truncated from under us? */
 		if (page->mapping != mapping || page_offset(page) > size) {
-			unlock_page(page);
 			ret = VM_FAULT_NOPAGE;
-			ext4_journal_stop(handle);
-			goto out;
+			goto out_error;
 		}
 
 		if (page->index == size >> PAGE_SHIFT)
@@ -6065,13 +6066,15 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 
 		err = __block_write_begin(page, 0, len, ext4_get_block);
 		if (!err) {
+			ret = VM_FAULT_SIGBUS;
 			if (ext4_walk_page_buffers(handle, page_buffers(page),
-					0, len, NULL, do_journal_get_write_access)) {
-				unlock_page(page);
-				ret = VM_FAULT_SIGBUS;
-				ext4_journal_stop(handle);
-				goto out;
-			}
+					0, len, NULL, do_journal_get_write_access))
+				goto out_error;
+			if (ext4_walk_page_buffers(handle, page_buffers(page),
+					0, len, NULL, write_end_fn))
+				goto out_error;
+			if (ext4_jbd2_inode_add_write(handle, inode, 0, len))
+				goto out_error;
 			ext4_set_inode_state(inode, EXT4_STATE_JDATA);
 		} else {
 			unlock_page(page);
@@ -6086,6 +6089,10 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 	up_read(&EXT4_I(inode)->i_mmap_sem);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
+out_error:
+	unlock_page(page);
+	ext4_journal_stop(handle);
+	goto out;
 }
 
 vm_fault_t ext4_filemap_fault(struct vm_fault *vmf)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a14c1ed39aa3..a2fc62a6d3b7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -472,6 +472,89 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
 	spin_unlock(&sbi->s_md_lock);
 }
 
+/*
+ * This writepage callback for write_cache_pages()
+ * takes care of a few cases after page cleaning.
+ *
+ * write_cache_pages() already checks for dirty pages
+ * and calls clear_page_dirty_for_io(), which we want,
+ * to write protect the pages.
+ *
+ * However, we may have to redirty a page (see below.)
+ */
+static int ext4_journalled_writepage_callback(struct page *page,
+					      struct writeback_control *wbc,
+					      void *data)
+{
+	transaction_t *transaction = (transaction_t *) data;
+	struct buffer_head *bh, *head;
+	struct journal_head *jh;
+
+	bh = head = page_buffers(page);
+	do {
+		/*
+		 * We have to redirty a page in these cases:
+		 * 1) If buffer is dirty, it means the page was dirty because it
+		 * contains a buffer that needs checkpointing. So the dirty bit
+		 * needs to be preserved so that checkpointing writes the buffer
+		 * properly.
+		 * 2) If buffer is not part of the committing transaction
+		 * (we may have just accidentally come across this buffer because
+		 * inode range tracking is not exact) or if the currently running
+		 * transaction already contains this buffer as well, dirty bit
+		 * needs to be preserved so that the buffer gets writeprotected
+		 * properly on running transaction's commit.
+		 */
+		jh = bh2jh(bh);
+		if (buffer_dirty(bh) ||
+		    (jh && (jh->b_transaction != transaction ||
+			    jh->b_next_transaction))) {
+			redirty_page_for_writepage(wbc, page);
+			goto out;
+		}
+	} while ((bh = bh->b_this_page) != head);
+
+out:
+	return AOP_WRITEPAGE_ACTIVATE;
+}
+
+static int ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+	struct writeback_control wbc = {
+		.sync_mode =  WB_SYNC_ALL,
+		.nr_to_write = LONG_MAX,
+		.range_start = jinode->i_dirty_start,
+		.range_end = jinode->i_dirty_end,
+        };
+
+	return write_cache_pages(mapping, &wbc,
+				 ext4_journalled_writepage_callback,
+				 jinode->i_transaction);
+}
+
+static int ext4_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	int ret;
+
+	if (ext4_should_journal_data(jinode->i_vfs_inode))
+		ret = ext4_journalled_submit_inode_data_buffers(jinode);
+	else
+		ret = jbd2_journal_submit_inode_data_buffers(jinode);
+
+	return ret;
+}
+
+static int ext4_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	int ret = 0;
+
+	if (!ext4_should_journal_data(jinode->i_vfs_inode))
+		ret = jbd2_journal_finish_inode_data_buffers(jinode);
+
+	return ret;
+}
+
 static bool system_going_down(void)
 {
 	return system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF
@@ -4647,9 +4730,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 
 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
 	sbi->s_journal->j_submit_inode_data_buffers =
-		jbd2_journal_submit_inode_data_buffers;
+		ext4_journal_submit_inode_data_buffers;
 	sbi->s_journal->j_finish_inode_data_buffers =
-		jbd2_journal_finish_inode_data_buffers;
+		ext4_journal_finish_inode_data_buffers;
 
 no_journal:
 	if (!test_opt(sb, NO_MBCACHE)) {
-- 
2.17.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH v5 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()
  2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-10-09  2:05   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 9+ messages in thread
From: Theodore Y. Ts'o @ 2020-10-09  2:05 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: linux-ext4, ocfs2-devel, Jan Kara, Andreas Dilger, dann frazier,
	Joseph Qi

On Mon, Oct 05, 2020 at 09:48:38PM -0300, Mauricio Faria de Oliveira wrote:
> Export functions that implement the current behavior done
> for an inode in journal_submit|finish_inode_data_buffers().
> 
> No functional change.
> 
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Andreas Dilger <adilger@dilger.ca>

Thanks, applied.

					- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH v5 2/4] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()
  2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 2/4] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-10-09  2:08   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 9+ messages in thread
From: Theodore Y. Ts'o @ 2020-10-09  2:08 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: linux-ext4, ocfs2-devel, Jan Kara, Andreas Dilger, dann frazier,
	Joseph Qi

On Mon, Oct 05, 2020 at 09:48:39PM -0300, Mauricio Faria de Oliveira wrote:
> Introduce journal callbacks to allow different behaviors
> for an inode in journal_submit|finish_inode_data_buffers().
> 
> The existing users of the current behavior (ext4, ocfs2)
> are adapted to use the previously exported functions
> that implement the current behavior.
> 
> Users are callers of jbd2_journal_inode_ranged_write|wait(),
> which adds the inode to the transaction's inode list with
> the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree.
> 
> Both CONFIG_EXT4_FS and CONFIG_OCSFS2_FS select CONFIG_JBD2,
> which builds fs/jbd2/commit.c and journal.c that define and
> export the functions, so we can call directly in ext4/ocfs2.
> 
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Andreas Dilger <adilger@dilger.ca>

Thanks, applied.

					- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH v5 3/4] ext4: data=journal: fixes for ext4_page_mkwrite()
  2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 3/4] ext4: data=journal: fixes for ext4_page_mkwrite() Mauricio Faria de Oliveira
@ 2020-10-09  2:10   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 9+ messages in thread
From: Theodore Y. Ts'o @ 2020-10-09  2:10 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: linux-ext4, ocfs2-devel, Jan Kara, Andreas Dilger, dann frazier,
	Joseph Qi

On Mon, Oct 05, 2020 at 09:48:40PM -0300, Mauricio Faria de Oliveira wrote:
> These are two fixes for data journalling required by
> the next patch, discovered while testing it.
> 
> First, the optimization to return early if all buffers
> are mapped is not appropriate for the next patch:
> 
> The inode _must_ be added to the transaction's list in
> data=journal mode (so to write-protect pages on commit)
> thus we cannot return early there.
> 
> Second, once that optimization to reduce transactions
> was disabled for data=journal mode, more transactions
> happened, and occasionally hit this warning message:
> 'JBD2: Spotted dirty metadata buffer'.
> 
> Reason is, block_page_mkwrite() will set_buffer_dirty()
> before do_journal_get_write_access() that is there to
> prevent it. This issue was masked by the optimization.
> 
> So, on data=journal use __block_write_begin() instead.
> This also requires page locking and len recalculation.
> (see block_page_mkwrite() for implementation details.)
> 
> Finally, as Jan noted there is little sharing between
> data=journal and other modes in ext4_page_mkwrite().
> 
> However, a prototype of ext4_journalled_page_mkwrite()
> showed there still would be lots of duplicated lines
> (tens of) that didn't seem worth it.
> 
> Thus this patch ends up with an ugly goto to skip all
> non-data journalling code (to avoid long indentations,
> but that can be changed..) in the beginning, and just
> a conditional in the transaction section.
> 
> Well, we skip a common part to data journalling which
> is the page truncated check, but we do it again after
> ext4_journal_start() when we re-acquire the page lock
> (so not to acquire the page lock twice needlessly for
> data journalling.)
> 
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Andreas Dilger <adilger@dilger.ca>

Thanks, applied.

				- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] [PATCH v5 4/4] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()
  2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 4/4] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-10-09  2:10   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 9+ messages in thread
From: Theodore Y. Ts'o @ 2020-10-09  2:10 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: linux-ext4, ocfs2-devel, Jan Kara, Andreas Dilger, dann frazier,
	Joseph Qi

On Mon, Oct 05, 2020 at 09:48:41PM -0300, Mauricio Faria de Oliveira wrote:
> This implements journal callbacks j_submit|finish_inode_data_buffers()
> with different behavior for data=journal: to write-protect pages under
> commit, preventing changes to buffers writeably mapped to userspace.
> 
> If a buffer's content changes between commit's checksum calculation
> and write-out to disk, it can cause journal recovery/mount failures
> upon a kernel crash or power loss.
> 
>     [   27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support!
>     [   27.339492] JBD2: Invalid checksum recovering data block 8705 in log
>     [   27.342716] JBD2: recovery failed
>     [   27.343316] EXT4-fs (loop0): error loading journal
>     mount: /ext4: can't read superblock on /dev/loop0.
> 
> In j_submit_inode_data_buffers() we write-protect the inode's pages
> with write_cache_pages() and redirty w/ writepage callback if needed.
> 
> In j_finish_inode_data_buffers() there is nothing do to.
> 
> And in order to use the callbacks, inodes are added to the inode list
> in transaction in __ext4_journalled_writepage() and ext4_page_mkwrite().
> 
> In ext4_page_mkwrite() we must make sure that the buffers are attached
> to the transaction as jbddirty with write_end_fn(), as already done in
> __ext4_journalled_writepage().
> 
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> Reported-by: Dann Frazier <dann.frazier@canonical.com>
> Reported-by: kernel test robot <lkp@intel.com> # wbc.nr_to_write
> Suggested-by: Jan Kara <jack@suse.cz>
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks, applied.

					- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, back to index

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-06  0:48 [Ocfs2-devel] [PATCH v5 0/4] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 1/4] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
2020-10-09  2:05   ` Theodore Y. Ts'o
2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 2/4] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
2020-10-09  2:08   ` Theodore Y. Ts'o
2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 3/4] ext4: data=journal: fixes for ext4_page_mkwrite() Mauricio Faria de Oliveira
2020-10-09  2:10   ` Theodore Y. Ts'o
2020-10-06  0:48 ` [Ocfs2-devel] [PATCH v5 4/4] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() Mauricio Faria de Oliveira
2020-10-09  2:10   ` Theodore Y. Ts'o

OCFS2-Devel Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/ocfs2-devel/0 ocfs2-devel/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ocfs2-devel ocfs2-devel/ https://lore.kernel.org/ocfs2-devel \
		ocfs2-devel@oss.oracle.com
	public-inbox-index ocfs2-devel

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/com.oracle.oss.ocfs2-devel


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git