linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit
@ 2020-09-10 19:31 Mauricio Faria de Oliveira
  2020-09-10 19:31 ` [RFC PATCH v3 1/3] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-09-10 19:31 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira

Hey Jan,

This series implements your suggestions for the RFC PATCH v2 set,
which we mostly talked about in cover letter [1] and PATCH 3 [2].
(I added Suggested-by: tags, by the way, for due credit.)

It looks almost good on fstests: zero regressions on data=ordered,
and only one regression in data=journal (generic/347); I'll check.
(That's with ext4; I'll check ocfs2, but it's only a minor change.)

However, there's an issue I have to check with you about, that we
exposed from the original kernel. It's described below, but other
than this I _guess_ this should be close if you don't spot errors.

Thanks!
Mauricio

...

Testing with 'stress-ng --mmap <N> --mmap-file' runs well for days,
but it occasionally hits:

  JBD2: Spotted dirty metadata buffer (dev = vdc, blocknr = 74775).
  There's a risk of filesystem corruption in case of system crash.

I added dump_stack() in warn_dirty_buffer(), and it usually comes
from jbd2_journal_file_buffer(, BJ_Forget) in the commit function.
When filing from BJ_Shadow to BJ_Forget.. so it possibly happened
while BH_Shadow was still set!

So I instrumented [test_]set_buffer_dirty() macros to dump_stack()
if BH_Shadow is set (i.e. buffer being set dirty during write-out.)

This showed that the occasional BH_Dirty setter with BH_Shadow set
is block_page_mkwrite() in ext4_page_mkwrite(). And it seems right,
because it's called before do_journal_get_write_access() (where we
check for/wait on BH_Shadow.)

ext4_page_mkwrite():

        err = block_page_mkwrite(vma, vmf, get_block);
        if (!err && ext4_should_journal_data(inode)) {
                if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
                          PAGE_SIZE, NULL, do_journal_get_write_access)) {

The patches didn't directly break this, they only allow this code
to run more often as we disabled an early-return optimization for
the case 'all buffers mapped' in data=journal (question 2 in [1]):

ext4_page_mkwrite():

         * Return if we have all the buffers mapped.
	...
-       if (page_has_buffers(page)) {
+       if (page_has_buffers(page) && !ext4_should_journal_data(inode)) {


In order to confirm it, I built the unpatched v5.9-rc4 kernel, with
just the change to disable that optimization in data=journal -- and
it hit that occasionally too ("JBD2: Spotted dirty metadata buffer".)

I was naive enough to mindlessly try to swap the order of those two
statements, in hopes that do_journal_get_write_access() should wait
for BH_Shadow to clear, and then we just call block_page_mkwrite().
BUT it trips into the BUG() check in page_buffers() in the former.

I still have to dig more about it, but if you have something quick
in mind, I'd appreciate any comments/feedback about it, if it gets
more complex than I can see.

Thanks again!

[1] https://lore.kernel.org/linux-ext4/20200819092735.GI1902@quack2.suse.cz/
[2] https://lore.kernel.org/linux-ext4/20200821102625.GB3432@quack2.suse.cz/

Mauricio Faria de Oliveira (3):
  jbd2: introduce/export functions
    jbd2_journal_submit|finish_inode_data_buffers()
  jbd2, ext4, ocfs2: introduce/use journal callbacks
    j_submit|finish_inode_data_buffers()
  ext4: data=journal: write-protect pages on
    j_submit_inode_data_buffers()

 fs/ext4/inode.c      | 29 +++++++++++-----
 fs/ext4/super.c      | 82 ++++++++++++++++++++++++++++++++++++++++++++
 fs/jbd2/commit.c     | 58 +++++++++++++++++--------------
 fs/jbd2/journal.c    |  2 ++
 fs/ocfs2/super.c     | 15 ++++++++
 include/linux/jbd2.h | 29 +++++++++++++++-
 6 files changed, 181 insertions(+), 34 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [RFC PATCH v3 1/3] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()
  2020-09-10 19:31 [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
@ 2020-09-10 19:31 ` Mauricio Faria de Oliveira
  2020-09-16 16:19   ` Jan Kara
  2020-09-10 19:31 ` [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-09-10 19:31 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira

Export functions that implement the current behavior done
for an inode in journal_submit|finish_inode_data_buffers().

No functional change.

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/commit.c     | 32 +++++++++++++++++---------------
 fs/jbd2/journal.c    |  2 ++
 include/linux/jbd2.h |  4 ++++
 3 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 6d2da8ad0e6f..c17cda96926e 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -187,9 +187,11 @@ static int journal_wait_on_commit_record(journal_t *journal,
  * use writepages() because with delayed allocation we may be doing
  * block allocation in writepages().
  */
-static int journal_submit_inode_data_buffers(struct address_space *mapping,
-		loff_t dirty_start, loff_t dirty_end)
+int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
 {
+	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+	loff_t dirty_start = jinode->i_dirty_start;
+	loff_t dirty_end = jinode->i_dirty_end;
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode =  WB_SYNC_ALL,
@@ -215,16 +217,11 @@ static int journal_submit_data_buffers(journal_t *journal,
 {
 	struct jbd2_inode *jinode;
 	int err, ret = 0;
-	struct address_space *mapping;
 
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
-		loff_t dirty_start = jinode->i_dirty_start;
-		loff_t dirty_end = jinode->i_dirty_end;
-
 		if (!(jinode->i_flags & JI_WRITE_DATA))
 			continue;
-		mapping = jinode->i_vfs_inode->i_mapping;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
 		/*
@@ -234,8 +231,7 @@ static int journal_submit_data_buffers(journal_t *journal,
 		 * only allocated blocks here.
 		 */
 		trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
-		err = journal_submit_inode_data_buffers(mapping, dirty_start,
-				dirty_end);
+		err = jbd2_journal_submit_inode_data_buffers(jinode);
 		if (!ret)
 			ret = err;
 		spin_lock(&journal->j_list_lock);
@@ -248,6 +244,17 @@ static int journal_submit_data_buffers(journal_t *journal,
 	return ret;
 }
 
+int jbd2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+	loff_t dirty_start = jinode->i_dirty_start;
+	loff_t dirty_end = jinode->i_dirty_end;
+	int ret;
+
+	ret = filemap_fdatawait_range_keep_errors(mapping, dirty_start, dirty_end);
+	return ret;
+}
+
 /*
  * Wait for data submitted for writeout, refile inodes to proper
  * transaction if needed.
@@ -262,16 +269,11 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
 	/* For locking, see the comment in journal_submit_data_buffers() */
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
-		loff_t dirty_start = jinode->i_dirty_start;
-		loff_t dirty_end = jinode->i_dirty_end;
-
 		if (!(jinode->i_flags & JI_WAIT_DATA))
 			continue;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
-		err = filemap_fdatawait_range_keep_errors(
-				jinode->i_vfs_inode->i_mapping, dirty_start,
-				dirty_end);
+		err = jbd2_journal_finish_inode_data_buffers(jinode);
 		if (!ret)
 			ret = err;
 		spin_lock(&journal->j_list_lock);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 17fdc482f554..c0600405e7a2 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -91,6 +91,8 @@ EXPORT_SYMBOL(jbd2_journal_try_to_free_buffers);
 EXPORT_SYMBOL(jbd2_journal_force_commit);
 EXPORT_SYMBOL(jbd2_journal_inode_ranged_write);
 EXPORT_SYMBOL(jbd2_journal_inode_ranged_wait);
+EXPORT_SYMBOL(jbd2_journal_submit_inode_data_buffers);
+EXPORT_SYMBOL(jbd2_journal_finish_inode_data_buffers);
 EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
 EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 08f904943ab2..2865a5475888 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1421,6 +1421,10 @@ extern int	   jbd2_journal_inode_ranged_write(handle_t *handle,
 extern int	   jbd2_journal_inode_ranged_wait(handle_t *handle,
 			struct jbd2_inode *inode, loff_t start_byte,
 			loff_t length);
+extern int	   jbd2_journal_submit_inode_data_buffers(
+			struct jbd2_inode *jinode);
+extern int	   jbd2_journal_finish_inode_data_buffers(
+			struct jbd2_inode *jinode);
 extern int	   jbd2_journal_begin_ordered_truncate(journal_t *journal,
 				struct jbd2_inode *inode, loff_t new_size);
 extern void	   jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()
  2020-09-10 19:31 [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
  2020-09-10 19:31 ` [RFC PATCH v3 1/3] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-10 19:31 ` Mauricio Faria de Oliveira
  2020-09-16 16:22   ` Jan Kara
  2020-09-10 19:31 ` [RFC PATCH v3 3/3] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() Mauricio Faria de Oliveira
  2020-09-16 16:15 ` [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit Jan Kara
  3 siblings, 1 reply; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-09-10 19:31 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira

Introduce journal callbacks to allow different behaviors
for an inode in journal_submit|finish_inode_data_buffers().

The existing users of the current behavior (ext4, ocfs2)
are adapted to use the previously exported functions
that implement the current behavior.

Users are callers of jbd2_journal_inode_ranged_write|wait(),
which adds the inode to the transaction's inode list with
the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree.

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/super.c      | 14 ++++++++++++++
 fs/jbd2/commit.c     | 30 ++++++++++++++++++------------
 fs/ocfs2/super.c     | 15 +++++++++++++++
 include/linux/jbd2.h | 25 ++++++++++++++++++++++++-
 4 files changed, 71 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index ea425b49b345..7303839d7ad9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -472,6 +472,16 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
 	spin_unlock(&sbi->s_md_lock);
 }
 
+static int ext4_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	return jbd2_journal_submit_inode_data_buffers(jinode);
+}
+
+static int ext4_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	return jbd2_journal_finish_inode_data_buffers(jinode);
+}
+
 static bool system_going_down(void)
 {
 	return system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF
@@ -4646,6 +4656,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
 
 	sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
+	sbi->s_journal->j_submit_inode_data_buffers =
+		ext4_journal_submit_inode_data_buffers;
+	sbi->s_journal->j_finish_inode_data_buffers =
+		ext4_journal_finish_inode_data_buffers;
 
 no_journal:
 	if (!test_opt(sb, NO_MBCACHE)) {
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index c17cda96926e..23d3fcc11b97 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -200,6 +200,12 @@ int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
 		.range_end = dirty_end,
 	};
 
+	/*
+	 * submit the inode data buffers. We use writepage
+	 * instead of writepages. Because writepages can do
+	 * block allocation with delalloc. We need to write
+	 * only allocated blocks here.
+	 */
 	ret = generic_writepages(mapping, &wbc);
 	return ret;
 }
@@ -224,16 +230,13 @@ static int journal_submit_data_buffers(journal_t *journal,
 			continue;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
-		/*
-		 * submit the inode data buffers. We use writepage
-		 * instead of writepages. Because writepages can do
-		 * block allocation  with delalloc. We need to write
-		 * only allocated blocks here.
-		 */
+		/* submit the inode data buffers. */
 		trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
-		err = jbd2_journal_submit_inode_data_buffers(jinode);
-		if (!ret)
-			ret = err;
+		if (journal->j_submit_inode_data_buffers) {
+			err = journal->j_submit_inode_data_buffers(jinode);
+			if (!ret)
+				ret = err;
+		}
 		spin_lock(&journal->j_list_lock);
 		J_ASSERT(jinode->i_transaction == commit_transaction);
 		jinode->i_flags &= ~JI_COMMIT_RUNNING;
@@ -273,9 +276,12 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
 			continue;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
-		err = jbd2_journal_finish_inode_data_buffers(jinode);
-		if (!ret)
-			ret = err;
+		/* wait for the inode data buffers writeout. */
+		if (journal->j_finish_inode_data_buffers) {
+			err = journal->j_finish_inode_data_buffers(jinode);
+			if (!ret)
+				ret = err;
+		}
 		spin_lock(&journal->j_list_lock);
 		jinode->i_flags &= ~JI_COMMIT_RUNNING;
 		smp_mb();
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 1d91dd1e8711..f4e62aafc89c 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -2010,6 +2010,16 @@ static int ocfs2_journal_addressable(struct ocfs2_super *osb)
 	return status;
 }
 
+static int ocfs2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	return jbd2_journal_submit_inode_data_buffers(jinode);
+}
+
+static int ocfs2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	return jbd2_journal_finish_inode_data_buffers(jinode);
+}
+
 static int ocfs2_initialize_super(struct super_block *sb,
 				  struct buffer_head *bh,
 				  int sector_size,
@@ -2211,6 +2221,11 @@ static int ocfs2_initialize_super(struct super_block *sb,
 	}
 	osb->journal = journal;
 	journal->j_osb = osb;
+	journal->j_journal->j_submit_inode_data_buffers =
+		ocfs2_journal_submit_inode_data_buffers;
+	journal->j_journal->j_finish_inode_data_buffers =
+		ocfs2_journal_finish_inode_data_buffers;
+
 
 	atomic_set(&journal->j_num_trans, 0);
 	init_rwsem(&journal->j_trans_barrier);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 2865a5475888..4aaa408c0ca7 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -629,7 +629,9 @@ struct transaction_s
 	struct journal_head	*t_shadow_list;
 
 	/*
-	 * List of inodes whose data we've modified in data=ordered mode.
+	 * List of inodes associated with the transaction; e.g., ext4 uses
+	 * this to track inodes in data=ordered and data=journal mode that
+	 * need special handling on transaction commit; also used by ocfs2.
 	 * [j_list_lock]
 	 */
 	struct list_head	t_inode_list;
@@ -1111,6 +1113,27 @@ struct journal_s
 	void			(*j_commit_callback)(journal_t *,
 						     transaction_t *);
 
+	/**
+	 * @j_submit_inode_data_buffers:
+	 *
+	 * This function is called for all inodes associated with the
+	 * committing transaction marked with JI_WRITE_DATA flag
+	 * before we start to write out the transaction to the journal.
+	 */
+	int			(*j_submit_inode_data_buffers)
+					(struct jbd2_inode *);
+
+	/**
+	 * @j_finish_inode_data_buffers:
+	 *
+	 * This function is called for all inodes associated with the
+	 * committing transaction marked with JI_WAIT_DATA flag
+	 * after we have written the transaction to the journal
+	 * but before we write out the commit block.
+	 */
+	int			(*j_finish_inode_data_buffers)
+					(struct jbd2_inode *);
+
 	/*
 	 * Journal statistics
 	 */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RFC PATCH v3 3/3] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()
  2020-09-10 19:31 [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
  2020-09-10 19:31 ` [RFC PATCH v3 1/3] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
  2020-09-10 19:31 ` [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-10 19:31 ` Mauricio Faria de Oliveira
  2020-09-16 16:59   ` Jan Kara
  2020-09-16 16:15 ` [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit Jan Kara
  3 siblings, 1 reply; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-09-10 19:31 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira

This implements journal callbacks j_submit|finish_inode_data_buffers()
with different behavior for data=journal: to write-protect pages under
commit, preventing changes to buffers writeably mapped to userspace.

If a buffer's content changes between commit's checksum calculation
and write-out to disk, it can cause journal recovery/mount failures
upon a kernel crash or power loss.

    [   27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support!
    [   27.339492] JBD2: Invalid checksum recovering data block 8705 in log
    [   27.342716] JBD2: recovery failed
    [   27.343316] EXT4-fs (loop0): error loading journal
    mount: /ext4: can't read superblock on /dev/loop0.

In j_submit_inode_data_buffers() we write-protect the inode's pages
with write_cache_pages() and redirty w/ writepage callback if needed.

In j_finish_inode_data_buffers() there is nothing do to.

And in order to use the callbacks, inodes are added to the inode list
in transaction in __ext4_journalled_writepage() and ext4_page_mkwrite().

In ext4_page_mkwrite() we must make sure that:

1) the inode is always added to the list;
   thus we skip the 'all buffers mapped' optimization on data=journal;

2) the buffers are attached to transaction as dirty;
   as already done in __ext4_journalled_writepage().

Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reported-by: Dann Frazier <dann.frazier@canonical.com>
---
 fs/ext4/inode.c | 29 ++++++++++++++------
 fs/ext4/super.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 91 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bf596467c234..fa4109da056c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1910,6 +1910,9 @@ static int __ext4_journalled_writepage(struct page *page,
 		err = ext4_walk_page_buffers(handle, page_bufs, 0, len, NULL,
 					     write_end_fn);
 	}
+	if (ret == 0)
+		ret = err;
+	err = ext4_jbd2_inode_add_write(handle, inode, 0, len);
 	if (ret == 0)
 		ret = err;
 	EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
@@ -6004,9 +6007,12 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 		len = PAGE_SIZE;
 	/*
 	 * Return if we have all the buffers mapped. This avoids the need to do
-	 * journal_start/journal_stop which can block and take a long time
+	 * journal_start/journal_stop which can block and take a long time.
+	 *
+	 * This cannot be done for data journalling, as we have to add the
+	 * inode to the transaction's list to writeprotect pages on commit.
 	 */
-	if (page_has_buffers(page)) {
+	if (page_has_buffers(page) && !ext4_should_journal_data(inode)) {
 		if (!ext4_walk_page_buffers(NULL, page_buffers(page),
 					    0, len, NULL,
 					    ext4_bh_unmapped)) {
@@ -6032,12 +6038,14 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 	err = block_page_mkwrite(vma, vmf, get_block);
 	if (!err && ext4_should_journal_data(inode)) {
 		if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
-			  PAGE_SIZE, NULL, do_journal_get_write_access)) {
-			unlock_page(page);
-			ret = VM_FAULT_SIGBUS;
-			ext4_journal_stop(handle);
-			goto out;
-		}
+			PAGE_SIZE, NULL, do_journal_get_write_access))
+			goto out_err;
+		/* Make sure buffers are attached to the transaction as dirty */
+		if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
+			PAGE_SIZE, NULL, write_end_fn))
+			goto out_err;
+		if (ext4_jbd2_inode_add_write(handle, inode, 0, PAGE_SIZE))
+			goto out_err;
 		ext4_set_inode_state(inode, EXT4_STATE_JDATA);
 	}
 	ext4_journal_stop(handle);
@@ -6049,6 +6057,11 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 	up_read(&EXT4_I(inode)->i_mmap_sem);
 	sb_end_pagefault(inode->i_sb);
 	return ret;
+out_err:
+	unlock_page(page);
+	ret = VM_FAULT_SIGBUS;
+	ext4_journal_stop(handle);
+	goto out;
 }
 
 vm_fault_t ext4_filemap_fault(struct vm_fault *vmf)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7303839d7ad9..528b5e20b71c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -472,14 +472,82 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
 	spin_unlock(&sbi->s_md_lock);
 }
 
+/*
+ * This writepage callback for write_cache_pages()
+ * takes care of a few cases after page cleaning.
+ *
+ * write_cache_pages() already checks for dirty pages
+ * and calls clear_page_dirty_for_io(), which we want,
+ * to write protect the pages.
+ *
+ * However, we have to redirty a page in these cases:
+ * 1) some buffer is dirty (needs checkpointing)
+ * 2) some buffer is not part of the committing transaction
+ * 3) some buffer already has b_next_transaction set
+ */
+
+static int ext4_journalled_writepage_callback(struct page *page,
+					      struct writeback_control *wbc,
+					      void *data)
+{
+	transaction_t *transaction = (transaction_t *) data;
+	struct buffer_head *bh, *head;
+	struct journal_head *jh;
+
+	bh = head = page_buffers(page);
+	do {
+		jh = bh2jh(bh);
+		if (buffer_dirty(bh) ||
+			(jh && (jh->b_transaction != transaction ||
+				jh->b_next_transaction))) {
+			redirty_page_for_writepage(wbc, page);
+			goto out;
+		}
+	} while ((bh = bh->b_this_page) != head);
+
+out:
+	return AOP_WRITEPAGE_ACTIVATE;
+}
+
+static int ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+	transaction_t *transaction = jinode->i_transaction;
+	loff_t dirty_start = jinode->i_dirty_start;
+	loff_t dirty_end = jinode->i_dirty_end;
+
+	struct writeback_control wbc = {
+		.sync_mode =  WB_SYNC_ALL,
+		.nr_to_write = ~0ULL,
+		.range_start = dirty_start,
+		.range_end = dirty_end,
+        };
+
+	return write_cache_pages(mapping, &wbc,
+				 ext4_journalled_writepage_callback,
+				 transaction);
+}
+
 static int ext4_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
 {
-	return jbd2_journal_submit_inode_data_buffers(jinode);
+	int ret;
+
+	if (ext4_should_journal_data(jinode->i_vfs_inode))
+		ret = ext4_journalled_submit_inode_data_buffers(jinode);
+	else
+		ret = jbd2_journal_submit_inode_data_buffers(jinode);
+
+	return ret;
 }
 
 static int ext4_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
 {
-	return jbd2_journal_finish_inode_data_buffers(jinode);
+	int ret = 0;
+
+	if (!ext4_should_journal_data(jinode->i_vfs_inode))
+		ret = jbd2_journal_finish_inode_data_buffers(jinode);
+
+	return ret;
 }
 
 static bool system_going_down(void)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit
  2020-09-10 19:31 [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
                   ` (2 preceding siblings ...)
  2020-09-10 19:31 ` [RFC PATCH v3 3/3] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-16 16:15 ` Jan Kara
  3 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2020-09-16 16:15 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira

Hi Mauricio!

On Thu 10-09-20 16:31:24, Mauricio Faria de Oliveira wrote:
> This series implements your suggestions for the RFC PATCH v2 set,
> which we mostly talked about in cover letter [1] and PATCH 3 [2].
> (I added Suggested-by: tags, by the way, for due credit.)
> 
> It looks almost good on fstests: zero regressions on data=ordered,
> and only one regression in data=journal (generic/347); I'll check.
> (That's with ext4; I'll check ocfs2, but it's only a minor change.)

Glad to hear that!

> Testing with 'stress-ng --mmap <N> --mmap-file' runs well for days,
> but it occasionally hits:
> 
>   JBD2: Spotted dirty metadata buffer (dev = vdc, blocknr = 74775).
>   There's a risk of filesystem corruption in case of system crash.
> 
> I added dump_stack() in warn_dirty_buffer(), and it usually comes
> from jbd2_journal_file_buffer(, BJ_Forget) in the commit function.
> When filing from BJ_Shadow to BJ_Forget.. so it possibly happened
> while BH_Shadow was still set!
> 
> So I instrumented [test_]set_buffer_dirty() macros to dump_stack()
> if BH_Shadow is set (i.e. buffer being set dirty during write-out.)
> 
> This showed that the occasional BH_Dirty setter with BH_Shadow set
> is block_page_mkwrite() in ext4_page_mkwrite(). And it seems right,
> because it's called before do_journal_get_write_access() (where we
> check for/wait on BH_Shadow.)
> 
> ext4_page_mkwrite():
> 
>         err = block_page_mkwrite(vma, vmf, get_block);
>         if (!err && ext4_should_journal_data(inode)) {
>                 if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
>                           PAGE_SIZE, NULL, do_journal_get_write_access)) {
> 
> The patches didn't directly break this, they only allow this code
> to run more often as we disabled an early-return optimization for
> the case 'all buffers mapped' in data=journal (question 2 in [1]):
> 
> ext4_page_mkwrite():
> 
>          * Return if we have all the buffers mapped.
> 	...
> -       if (page_has_buffers(page)) {
> +       if (page_has_buffers(page) && !ext4_should_journal_data(inode)) {
> 
> 
> In order to confirm it, I built the unpatched v5.9-rc4 kernel, with
> just the change to disable that optimization in data=journal -- and
> it hit that occasionally too ("JBD2: Spotted dirty metadata buffer".)
> 
> I was naive enough to mindlessly try to swap the order of those two
> statements, in hopes that do_journal_get_write_access() should wait
> for BH_Shadow to clear, and then we just call block_page_mkwrite().
> BUT it trips into the BUG() check in page_buffers() in the former.

Yeah, you're right that code is wrong for data=journal case. We cannot
really use block_page_mkwrite() for it - we rather need to use there
something like:

	__block_write_begin(page, pos, len, ext4_get_block);
	ext4_walk_page_buffers(handle, page_buffers(page),
                                             from, to, NULL,
                                             do_journal_get_write_access);
	ext4_walk_page_buffers(handle, page_buffers(page), from, to, NULL,
				write_end_fn);

or something like that, I guess you get the idea...

Actually, I think data=journal mode should get it's own .page_mkwrite
handler because the sharing between data=journal handling and the other
cases is pretty minimal.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3 1/3] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()
  2020-09-10 19:31 ` [RFC PATCH v3 1/3] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-16 16:19   ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2020-09-16 16:19 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira

On Thu 10-09-20 16:31:25, Mauricio Faria de Oliveira wrote:
> Export functions that implement the current behavior done
> for an inode in journal_submit|finish_inode_data_buffers().
> 
> No functional change.
> 
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> Suggested-by: Jan Kara <jack@suse.cz>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/jbd2/commit.c     | 32 +++++++++++++++++---------------
>  fs/jbd2/journal.c    |  2 ++
>  include/linux/jbd2.h |  4 ++++
>  3 files changed, 23 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 6d2da8ad0e6f..c17cda96926e 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -187,9 +187,11 @@ static int journal_wait_on_commit_record(journal_t *journal,
>   * use writepages() because with delayed allocation we may be doing
>   * block allocation in writepages().
>   */
> -static int journal_submit_inode_data_buffers(struct address_space *mapping,
> -		loff_t dirty_start, loff_t dirty_end)
> +int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
>  {
> +	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
> +	loff_t dirty_start = jinode->i_dirty_start;
> +	loff_t dirty_end = jinode->i_dirty_end;
>  	int ret;
>  	struct writeback_control wbc = {
>  		.sync_mode =  WB_SYNC_ALL,
> @@ -215,16 +217,11 @@ static int journal_submit_data_buffers(journal_t *journal,
>  {
>  	struct jbd2_inode *jinode;
>  	int err, ret = 0;
> -	struct address_space *mapping;
>  
>  	spin_lock(&journal->j_list_lock);
>  	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> -		loff_t dirty_start = jinode->i_dirty_start;
> -		loff_t dirty_end = jinode->i_dirty_end;
> -
>  		if (!(jinode->i_flags & JI_WRITE_DATA))
>  			continue;
> -		mapping = jinode->i_vfs_inode->i_mapping;
>  		jinode->i_flags |= JI_COMMIT_RUNNING;
>  		spin_unlock(&journal->j_list_lock);
>  		/*
> @@ -234,8 +231,7 @@ static int journal_submit_data_buffers(journal_t *journal,
>  		 * only allocated blocks here.
>  		 */
>  		trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
> -		err = journal_submit_inode_data_buffers(mapping, dirty_start,
> -				dirty_end);
> +		err = jbd2_journal_submit_inode_data_buffers(jinode);
>  		if (!ret)
>  			ret = err;
>  		spin_lock(&journal->j_list_lock);
> @@ -248,6 +244,17 @@ static int journal_submit_data_buffers(journal_t *journal,
>  	return ret;
>  }
>  
> +int jbd2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
> +{
> +	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
> +	loff_t dirty_start = jinode->i_dirty_start;
> +	loff_t dirty_end = jinode->i_dirty_end;
> +	int ret;
> +
> +	ret = filemap_fdatawait_range_keep_errors(mapping, dirty_start, dirty_end);
> +	return ret;
> +}
> +
>  /*
>   * Wait for data submitted for writeout, refile inodes to proper
>   * transaction if needed.
> @@ -262,16 +269,11 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
>  	/* For locking, see the comment in journal_submit_data_buffers() */
>  	spin_lock(&journal->j_list_lock);
>  	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> -		loff_t dirty_start = jinode->i_dirty_start;
> -		loff_t dirty_end = jinode->i_dirty_end;
> -
>  		if (!(jinode->i_flags & JI_WAIT_DATA))
>  			continue;
>  		jinode->i_flags |= JI_COMMIT_RUNNING;
>  		spin_unlock(&journal->j_list_lock);
> -		err = filemap_fdatawait_range_keep_errors(
> -				jinode->i_vfs_inode->i_mapping, dirty_start,
> -				dirty_end);
> +		err = jbd2_journal_finish_inode_data_buffers(jinode);
>  		if (!ret)
>  			ret = err;
>  		spin_lock(&journal->j_list_lock);
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 17fdc482f554..c0600405e7a2 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -91,6 +91,8 @@ EXPORT_SYMBOL(jbd2_journal_try_to_free_buffers);
>  EXPORT_SYMBOL(jbd2_journal_force_commit);
>  EXPORT_SYMBOL(jbd2_journal_inode_ranged_write);
>  EXPORT_SYMBOL(jbd2_journal_inode_ranged_wait);
> +EXPORT_SYMBOL(jbd2_journal_submit_inode_data_buffers);
> +EXPORT_SYMBOL(jbd2_journal_finish_inode_data_buffers);
>  EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
>  EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
>  EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 08f904943ab2..2865a5475888 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1421,6 +1421,10 @@ extern int	   jbd2_journal_inode_ranged_write(handle_t *handle,
>  extern int	   jbd2_journal_inode_ranged_wait(handle_t *handle,
>  			struct jbd2_inode *inode, loff_t start_byte,
>  			loff_t length);
> +extern int	   jbd2_journal_submit_inode_data_buffers(
> +			struct jbd2_inode *jinode);
> +extern int	   jbd2_journal_finish_inode_data_buffers(
> +			struct jbd2_inode *jinode);
>  extern int	   jbd2_journal_begin_ordered_truncate(journal_t *journal,
>  				struct jbd2_inode *inode, loff_t new_size);
>  extern void	   jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
> -- 
> 2.17.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()
  2020-09-10 19:31 ` [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-16 16:22   ` Jan Kara
  2020-09-16 16:52     ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2020-09-16 16:22 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira

On Thu 10-09-20 16:31:26, Mauricio Faria de Oliveira wrote:
> Introduce journal callbacks to allow different behaviors
> for an inode in journal_submit|finish_inode_data_buffers().
> 
> The existing users of the current behavior (ext4, ocfs2)
> are adapted to use the previously exported functions
> that implement the current behavior.
> 
> Users are callers of jbd2_journal_inode_ranged_write|wait(),
> which adds the inode to the transaction's inode list with
> the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree.
> 
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/super.c      | 14 ++++++++++++++
>  fs/jbd2/commit.c     | 30 ++++++++++++++++++------------
>  fs/ocfs2/super.c     | 15 +++++++++++++++
>  include/linux/jbd2.h | 25 ++++++++++++++++++++++++-
>  4 files changed, 71 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index ea425b49b345..7303839d7ad9 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -472,6 +472,16 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
>  	spin_unlock(&sbi->s_md_lock);
>  }
>  
> +static int ext4_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
> +{
> +	return jbd2_journal_submit_inode_data_buffers(jinode);
> +}
> +
> +static int ext4_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
> +{
> +	return jbd2_journal_finish_inode_data_buffers(jinode);
> +}
> +

No need for these ext4 wrappers. They just obfuscate code... Ditto for
ocfs2 below.

> @@ -1111,6 +1113,27 @@ struct journal_s
>  	void			(*j_commit_callback)(journal_t *,
>  						     transaction_t *);
>  
> +	/**
> +	 * @j_submit_inode_data_buffers:
> +	 *
> +	 * This function is called for all inodes associated with the
> +	 * committing transaction marked with JI_WRITE_DATA flag
> +	 * before we start to write out the transaction to the journal.
> +	 */
> +	int			(*j_submit_inode_data_buffers)
> +					(struct jbd2_inode *);
> +
> +	/**
> +	 * @j_finish_inode_data_buffers:
> +	 *
> +	 * This function is called for all inodes associated with the
> +	 * committing transaction marked with JI_WAIT_DATA flag
> +	 * after we have written the transaction to the journal
> +	 * but before we write out the commit block.
> +	 */
> +	int			(*j_finish_inode_data_buffers)
> +					(struct jbd2_inode *);
> +

Having the callbacks in the journal_s will not work if we have inodes with
data journalling on a filesystem mounted in data=ordered mode. The
callbacks really need to be a per-inode thing so I'd add them to struct
jbd2_inode.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()
  2020-09-16 16:22   ` Jan Kara
@ 2020-09-16 16:52     ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2020-09-16 16:52 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira

On Wed 16-09-20 18:22:40, Jan Kara wrote:
> On Thu 10-09-20 16:31:26, Mauricio Faria de Oliveira wrote:
> > @@ -1111,6 +1113,27 @@ struct journal_s
> >  	void			(*j_commit_callback)(journal_t *,
> >  						     transaction_t *);
> >  
> > +	/**
> > +	 * @j_submit_inode_data_buffers:
> > +	 *
> > +	 * This function is called for all inodes associated with the
> > +	 * committing transaction marked with JI_WRITE_DATA flag
> > +	 * before we start to write out the transaction to the journal.
> > +	 */
> > +	int			(*j_submit_inode_data_buffers)
> > +					(struct jbd2_inode *);
> > +
> > +	/**
> > +	 * @j_finish_inode_data_buffers:
> > +	 *
> > +	 * This function is called for all inodes associated with the
> > +	 * committing transaction marked with JI_WAIT_DATA flag
> > +	 * after we have written the transaction to the journal
> > +	 * but before we write out the commit block.
> > +	 */
> > +	int			(*j_finish_inode_data_buffers)
> > +					(struct jbd2_inode *);
> > +
> 
> Having the callbacks in the journal_s will not work if we have inodes with
> data journalling on a filesystem mounted in data=ordered mode. The
> callbacks really need to be a per-inode thing so I'd add them to struct
> jbd2_inode.

Oh, now I see that you properly handle this in the callbacks. So I retract
this objection. So feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

after removing those pointless wrappers.

								Hoonza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC PATCH v3 3/3] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()
  2020-09-10 19:31 ` [RFC PATCH v3 3/3] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-16 16:59   ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2020-09-16 16:59 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira

On Thu 10-09-20 16:31:27, Mauricio Faria de Oliveira wrote:
> This implements journal callbacks j_submit|finish_inode_data_buffers()
> with different behavior for data=journal: to write-protect pages under
> commit, preventing changes to buffers writeably mapped to userspace.
> 
> If a buffer's content changes between commit's checksum calculation
> and write-out to disk, it can cause journal recovery/mount failures
> upon a kernel crash or power loss.
> 
>     [   27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support!
>     [   27.339492] JBD2: Invalid checksum recovering data block 8705 in log
>     [   27.342716] JBD2: recovery failed
>     [   27.343316] EXT4-fs (loop0): error loading journal
>     mount: /ext4: can't read superblock on /dev/loop0.
> 
> In j_submit_inode_data_buffers() we write-protect the inode's pages
> with write_cache_pages() and redirty w/ writepage callback if needed.
> 
> In j_finish_inode_data_buffers() there is nothing do to.
> 
> And in order to use the callbacks, inodes are added to the inode list
> in transaction in __ext4_journalled_writepage() and ext4_page_mkwrite().
> 
> In ext4_page_mkwrite() we must make sure that:
> 
> 1) the inode is always added to the list;
>    thus we skip the 'all buffers mapped' optimization on data=journal;
> 
> 2) the buffers are attached to transaction as dirty;
>    as already done in __ext4_journalled_writepage().
> 
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> Reported-by: Dann Frazier <dann.frazier@canonical.com>
> ---
>  fs/ext4/inode.c | 29 ++++++++++++++------
>  fs/ext4/super.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 91 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index bf596467c234..fa4109da056c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1910,6 +1910,9 @@ static int __ext4_journalled_writepage(struct page *page,
>  		err = ext4_walk_page_buffers(handle, page_bufs, 0, len, NULL,
>  					     write_end_fn);
>  	}
> +	if (ret == 0)
> +		ret = err;
> +	err = ext4_jbd2_inode_add_write(handle, inode, 0, len);
>  	if (ret == 0)
>  		ret = err;
>  	EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
> @@ -6004,9 +6007,12 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  		len = PAGE_SIZE;
>  	/*
>  	 * Return if we have all the buffers mapped. This avoids the need to do
> -	 * journal_start/journal_stop which can block and take a long time
> +	 * journal_start/journal_stop which can block and take a long time.
> +	 *
> +	 * This cannot be done for data journalling, as we have to add the
> +	 * inode to the transaction's list to writeprotect pages on commit.
>  	 */
> -	if (page_has_buffers(page)) {
> +	if (page_has_buffers(page) && !ext4_should_journal_data(inode)) {
>  		if (!ext4_walk_page_buffers(NULL, page_buffers(page),
>  					    0, len, NULL,
>  					    ext4_bh_unmapped)) {
> @@ -6032,12 +6038,14 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  	err = block_page_mkwrite(vma, vmf, get_block);
>  	if (!err && ext4_should_journal_data(inode)) {
>  		if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
> -			  PAGE_SIZE, NULL, do_journal_get_write_access)) {
> -			unlock_page(page);
> -			ret = VM_FAULT_SIGBUS;
> -			ext4_journal_stop(handle);
> -			goto out;
> -		}
> +			PAGE_SIZE, NULL, do_journal_get_write_access))
> +			goto out_err;
> +		/* Make sure buffers are attached to the transaction as dirty */
> +		if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
> +			PAGE_SIZE, NULL, write_end_fn))
> +			goto out_err;

I think here we need to be a bit more careful. As I wrote in my answer to
cover letter we cannot call block_page_mkwrite(). Instead we need to lock
the page, compute 'len' again from i_size, then call __block_write_begin()
to map (or allocate) and read blocks, and then ext4_walk_page_buffers()
which needs to walk from 0 to len. And then unlock the page.

> +		if (ext4_jbd2_inode_add_write(handle, inode, 0, PAGE_SIZE))
> +			goto out_err;
>  		ext4_set_inode_state(inode, EXT4_STATE_JDATA);
>  	}
>  	ext4_journal_stop(handle);
> @@ -6049,6 +6057,11 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  	up_read(&EXT4_I(inode)->i_mmap_sem);
>  	sb_end_pagefault(inode->i_sb);
>  	return ret;
> +out_err:
> +	unlock_page(page);
> +	ret = VM_FAULT_SIGBUS;
> +	ext4_journal_stop(handle);
> +	goto out;
>  }

Otherwise the patch looks good to me!

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-09-16 20:57 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-10 19:31 [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
2020-09-10 19:31 ` [RFC PATCH v3 1/3] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
2020-09-16 16:19   ` Jan Kara
2020-09-10 19:31 ` [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
2020-09-16 16:22   ` Jan Kara
2020-09-16 16:52     ` Jan Kara
2020-09-10 19:31 ` [RFC PATCH v3 3/3] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() Mauricio Faria de Oliveira
2020-09-16 16:59   ` Jan Kara
2020-09-16 16:15 ` [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).