* [RFC PATCH v3 1/3] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()
2020-09-10 19:31 [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
@ 2020-09-10 19:31 ` Mauricio Faria de Oliveira
2020-09-16 16:19 ` Jan Kara
2020-09-10 19:31 ` [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
` (2 subsequent siblings)
3 siblings, 1 reply; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-09-10 19:31 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira
Export functions that implement the current behavior done
for an inode in journal_submit|finish_inode_data_buffers().
No functional change.
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
---
fs/jbd2/commit.c | 32 +++++++++++++++++---------------
fs/jbd2/journal.c | 2 ++
include/linux/jbd2.h | 4 ++++
3 files changed, 23 insertions(+), 15 deletions(-)
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 6d2da8ad0e6f..c17cda96926e 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -187,9 +187,11 @@ static int journal_wait_on_commit_record(journal_t *journal,
* use writepages() because with delayed allocation we may be doing
* block allocation in writepages().
*/
-static int journal_submit_inode_data_buffers(struct address_space *mapping,
- loff_t dirty_start, loff_t dirty_end)
+int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
{
+ struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+ loff_t dirty_start = jinode->i_dirty_start;
+ loff_t dirty_end = jinode->i_dirty_end;
int ret;
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
@@ -215,16 +217,11 @@ static int journal_submit_data_buffers(journal_t *journal,
{
struct jbd2_inode *jinode;
int err, ret = 0;
- struct address_space *mapping;
spin_lock(&journal->j_list_lock);
list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
- loff_t dirty_start = jinode->i_dirty_start;
- loff_t dirty_end = jinode->i_dirty_end;
-
if (!(jinode->i_flags & JI_WRITE_DATA))
continue;
- mapping = jinode->i_vfs_inode->i_mapping;
jinode->i_flags |= JI_COMMIT_RUNNING;
spin_unlock(&journal->j_list_lock);
/*
@@ -234,8 +231,7 @@ static int journal_submit_data_buffers(journal_t *journal,
* only allocated blocks here.
*/
trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
- err = journal_submit_inode_data_buffers(mapping, dirty_start,
- dirty_end);
+ err = jbd2_journal_submit_inode_data_buffers(jinode);
if (!ret)
ret = err;
spin_lock(&journal->j_list_lock);
@@ -248,6 +244,17 @@ static int journal_submit_data_buffers(journal_t *journal,
return ret;
}
+int jbd2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
+{
+ struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+ loff_t dirty_start = jinode->i_dirty_start;
+ loff_t dirty_end = jinode->i_dirty_end;
+ int ret;
+
+ ret = filemap_fdatawait_range_keep_errors(mapping, dirty_start, dirty_end);
+ return ret;
+}
+
/*
* Wait for data submitted for writeout, refile inodes to proper
* transaction if needed.
@@ -262,16 +269,11 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
/* For locking, see the comment in journal_submit_data_buffers() */
spin_lock(&journal->j_list_lock);
list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
- loff_t dirty_start = jinode->i_dirty_start;
- loff_t dirty_end = jinode->i_dirty_end;
-
if (!(jinode->i_flags & JI_WAIT_DATA))
continue;
jinode->i_flags |= JI_COMMIT_RUNNING;
spin_unlock(&journal->j_list_lock);
- err = filemap_fdatawait_range_keep_errors(
- jinode->i_vfs_inode->i_mapping, dirty_start,
- dirty_end);
+ err = jbd2_journal_finish_inode_data_buffers(jinode);
if (!ret)
ret = err;
spin_lock(&journal->j_list_lock);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 17fdc482f554..c0600405e7a2 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -91,6 +91,8 @@ EXPORT_SYMBOL(jbd2_journal_try_to_free_buffers);
EXPORT_SYMBOL(jbd2_journal_force_commit);
EXPORT_SYMBOL(jbd2_journal_inode_ranged_write);
EXPORT_SYMBOL(jbd2_journal_inode_ranged_wait);
+EXPORT_SYMBOL(jbd2_journal_submit_inode_data_buffers);
+EXPORT_SYMBOL(jbd2_journal_finish_inode_data_buffers);
EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 08f904943ab2..2865a5475888 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1421,6 +1421,10 @@ extern int jbd2_journal_inode_ranged_write(handle_t *handle,
extern int jbd2_journal_inode_ranged_wait(handle_t *handle,
struct jbd2_inode *inode, loff_t start_byte,
loff_t length);
+extern int jbd2_journal_submit_inode_data_buffers(
+ struct jbd2_inode *jinode);
+extern int jbd2_journal_finish_inode_data_buffers(
+ struct jbd2_inode *jinode);
extern int jbd2_journal_begin_ordered_truncate(journal_t *journal,
struct jbd2_inode *inode, loff_t new_size);
extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
--
2.17.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC PATCH v3 1/3] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()
2020-09-10 19:31 ` [RFC PATCH v3 1/3] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-16 16:19 ` Jan Kara
0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2020-09-16 16:19 UTC (permalink / raw)
To: Mauricio Faria de Oliveira
Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira
On Thu 10-09-20 16:31:25, Mauricio Faria de Oliveira wrote:
> Export functions that implement the current behavior done
> for an inode in journal_submit|finish_inode_data_buffers().
>
> No functional change.
>
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> Suggested-by: Jan Kara <jack@suse.cz>
The patch looks good to me. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> fs/jbd2/commit.c | 32 +++++++++++++++++---------------
> fs/jbd2/journal.c | 2 ++
> include/linux/jbd2.h | 4 ++++
> 3 files changed, 23 insertions(+), 15 deletions(-)
>
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 6d2da8ad0e6f..c17cda96926e 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -187,9 +187,11 @@ static int journal_wait_on_commit_record(journal_t *journal,
> * use writepages() because with delayed allocation we may be doing
> * block allocation in writepages().
> */
> -static int journal_submit_inode_data_buffers(struct address_space *mapping,
> - loff_t dirty_start, loff_t dirty_end)
> +int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
> {
> + struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
> + loff_t dirty_start = jinode->i_dirty_start;
> + loff_t dirty_end = jinode->i_dirty_end;
> int ret;
> struct writeback_control wbc = {
> .sync_mode = WB_SYNC_ALL,
> @@ -215,16 +217,11 @@ static int journal_submit_data_buffers(journal_t *journal,
> {
> struct jbd2_inode *jinode;
> int err, ret = 0;
> - struct address_space *mapping;
>
> spin_lock(&journal->j_list_lock);
> list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> - loff_t dirty_start = jinode->i_dirty_start;
> - loff_t dirty_end = jinode->i_dirty_end;
> -
> if (!(jinode->i_flags & JI_WRITE_DATA))
> continue;
> - mapping = jinode->i_vfs_inode->i_mapping;
> jinode->i_flags |= JI_COMMIT_RUNNING;
> spin_unlock(&journal->j_list_lock);
> /*
> @@ -234,8 +231,7 @@ static int journal_submit_data_buffers(journal_t *journal,
> * only allocated blocks here.
> */
> trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
> - err = journal_submit_inode_data_buffers(mapping, dirty_start,
> - dirty_end);
> + err = jbd2_journal_submit_inode_data_buffers(jinode);
> if (!ret)
> ret = err;
> spin_lock(&journal->j_list_lock);
> @@ -248,6 +244,17 @@ static int journal_submit_data_buffers(journal_t *journal,
> return ret;
> }
>
> +int jbd2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
> +{
> + struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
> + loff_t dirty_start = jinode->i_dirty_start;
> + loff_t dirty_end = jinode->i_dirty_end;
> + int ret;
> +
> + ret = filemap_fdatawait_range_keep_errors(mapping, dirty_start, dirty_end);
> + return ret;
> +}
> +
> /*
> * Wait for data submitted for writeout, refile inodes to proper
> * transaction if needed.
> @@ -262,16 +269,11 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
> /* For locking, see the comment in journal_submit_data_buffers() */
> spin_lock(&journal->j_list_lock);
> list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> - loff_t dirty_start = jinode->i_dirty_start;
> - loff_t dirty_end = jinode->i_dirty_end;
> -
> if (!(jinode->i_flags & JI_WAIT_DATA))
> continue;
> jinode->i_flags |= JI_COMMIT_RUNNING;
> spin_unlock(&journal->j_list_lock);
> - err = filemap_fdatawait_range_keep_errors(
> - jinode->i_vfs_inode->i_mapping, dirty_start,
> - dirty_end);
> + err = jbd2_journal_finish_inode_data_buffers(jinode);
> if (!ret)
> ret = err;
> spin_lock(&journal->j_list_lock);
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 17fdc482f554..c0600405e7a2 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -91,6 +91,8 @@ EXPORT_SYMBOL(jbd2_journal_try_to_free_buffers);
> EXPORT_SYMBOL(jbd2_journal_force_commit);
> EXPORT_SYMBOL(jbd2_journal_inode_ranged_write);
> EXPORT_SYMBOL(jbd2_journal_inode_ranged_wait);
> +EXPORT_SYMBOL(jbd2_journal_submit_inode_data_buffers);
> +EXPORT_SYMBOL(jbd2_journal_finish_inode_data_buffers);
> EXPORT_SYMBOL(jbd2_journal_init_jbd_inode);
> EXPORT_SYMBOL(jbd2_journal_release_jbd_inode);
> EXPORT_SYMBOL(jbd2_journal_begin_ordered_truncate);
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 08f904943ab2..2865a5475888 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1421,6 +1421,10 @@ extern int jbd2_journal_inode_ranged_write(handle_t *handle,
> extern int jbd2_journal_inode_ranged_wait(handle_t *handle,
> struct jbd2_inode *inode, loff_t start_byte,
> loff_t length);
> +extern int jbd2_journal_submit_inode_data_buffers(
> + struct jbd2_inode *jinode);
> +extern int jbd2_journal_finish_inode_data_buffers(
> + struct jbd2_inode *jinode);
> extern int jbd2_journal_begin_ordered_truncate(journal_t *journal,
> struct jbd2_inode *inode, loff_t new_size);
> extern void jbd2_journal_init_jbd_inode(struct jbd2_inode *jinode, struct inode *inode);
> --
> 2.17.1
>
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()
2020-09-10 19:31 [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
2020-09-10 19:31 ` [RFC PATCH v3 1/3] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-10 19:31 ` Mauricio Faria de Oliveira
2020-09-16 16:22 ` Jan Kara
2020-09-10 19:31 ` [RFC PATCH v3 3/3] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() Mauricio Faria de Oliveira
2020-09-16 16:15 ` [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit Jan Kara
3 siblings, 1 reply; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-09-10 19:31 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira
Introduce journal callbacks to allow different behaviors
for an inode in journal_submit|finish_inode_data_buffers().
The existing users of the current behavior (ext4, ocfs2)
are adapted to use the previously exported functions
that implement the current behavior.
Users are callers of jbd2_journal_inode_ranged_write|wait(),
which adds the inode to the transaction's inode list with
the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree.
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
---
fs/ext4/super.c | 14 ++++++++++++++
fs/jbd2/commit.c | 30 ++++++++++++++++++------------
fs/ocfs2/super.c | 15 +++++++++++++++
include/linux/jbd2.h | 25 ++++++++++++++++++++++++-
4 files changed, 71 insertions(+), 13 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index ea425b49b345..7303839d7ad9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -472,6 +472,16 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
spin_unlock(&sbi->s_md_lock);
}
+static int ext4_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
+{
+ return jbd2_journal_submit_inode_data_buffers(jinode);
+}
+
+static int ext4_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
+{
+ return jbd2_journal_finish_inode_data_buffers(jinode);
+}
+
static bool system_going_down(void)
{
return system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF
@@ -4646,6 +4656,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
sbi->s_journal->j_commit_callback = ext4_journal_commit_callback;
+ sbi->s_journal->j_submit_inode_data_buffers =
+ ext4_journal_submit_inode_data_buffers;
+ sbi->s_journal->j_finish_inode_data_buffers =
+ ext4_journal_finish_inode_data_buffers;
no_journal:
if (!test_opt(sb, NO_MBCACHE)) {
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index c17cda96926e..23d3fcc11b97 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -200,6 +200,12 @@ int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
.range_end = dirty_end,
};
+ /*
+ * submit the inode data buffers. We use writepage
+ * instead of writepages. Because writepages can do
+ * block allocation with delalloc. We need to write
+ * only allocated blocks here.
+ */
ret = generic_writepages(mapping, &wbc);
return ret;
}
@@ -224,16 +230,13 @@ static int journal_submit_data_buffers(journal_t *journal,
continue;
jinode->i_flags |= JI_COMMIT_RUNNING;
spin_unlock(&journal->j_list_lock);
- /*
- * submit the inode data buffers. We use writepage
- * instead of writepages. Because writepages can do
- * block allocation with delalloc. We need to write
- * only allocated blocks here.
- */
+ /* submit the inode data buffers. */
trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
- err = jbd2_journal_submit_inode_data_buffers(jinode);
- if (!ret)
- ret = err;
+ if (journal->j_submit_inode_data_buffers) {
+ err = journal->j_submit_inode_data_buffers(jinode);
+ if (!ret)
+ ret = err;
+ }
spin_lock(&journal->j_list_lock);
J_ASSERT(jinode->i_transaction == commit_transaction);
jinode->i_flags &= ~JI_COMMIT_RUNNING;
@@ -273,9 +276,12 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
continue;
jinode->i_flags |= JI_COMMIT_RUNNING;
spin_unlock(&journal->j_list_lock);
- err = jbd2_journal_finish_inode_data_buffers(jinode);
- if (!ret)
- ret = err;
+ /* wait for the inode data buffers writeout. */
+ if (journal->j_finish_inode_data_buffers) {
+ err = journal->j_finish_inode_data_buffers(jinode);
+ if (!ret)
+ ret = err;
+ }
spin_lock(&journal->j_list_lock);
jinode->i_flags &= ~JI_COMMIT_RUNNING;
smp_mb();
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 1d91dd1e8711..f4e62aafc89c 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -2010,6 +2010,16 @@ static int ocfs2_journal_addressable(struct ocfs2_super *osb)
return status;
}
+static int ocfs2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
+{
+ return jbd2_journal_submit_inode_data_buffers(jinode);
+}
+
+static int ocfs2_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
+{
+ return jbd2_journal_finish_inode_data_buffers(jinode);
+}
+
static int ocfs2_initialize_super(struct super_block *sb,
struct buffer_head *bh,
int sector_size,
@@ -2211,6 +2221,11 @@ static int ocfs2_initialize_super(struct super_block *sb,
}
osb->journal = journal;
journal->j_osb = osb;
+ journal->j_journal->j_submit_inode_data_buffers =
+ ocfs2_journal_submit_inode_data_buffers;
+ journal->j_journal->j_finish_inode_data_buffers =
+ ocfs2_journal_finish_inode_data_buffers;
+
atomic_set(&journal->j_num_trans, 0);
init_rwsem(&journal->j_trans_barrier);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 2865a5475888..4aaa408c0ca7 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -629,7 +629,9 @@ struct transaction_s
struct journal_head *t_shadow_list;
/*
- * List of inodes whose data we've modified in data=ordered mode.
+ * List of inodes associated with the transaction; e.g., ext4 uses
+ * this to track inodes in data=ordered and data=journal mode that
+ * need special handling on transaction commit; also used by ocfs2.
* [j_list_lock]
*/
struct list_head t_inode_list;
@@ -1111,6 +1113,27 @@ struct journal_s
void (*j_commit_callback)(journal_t *,
transaction_t *);
+ /**
+ * @j_submit_inode_data_buffers:
+ *
+ * This function is called for all inodes associated with the
+ * committing transaction marked with JI_WRITE_DATA flag
+ * before we start to write out the transaction to the journal.
+ */
+ int (*j_submit_inode_data_buffers)
+ (struct jbd2_inode *);
+
+ /**
+ * @j_finish_inode_data_buffers:
+ *
+ * This function is called for all inodes associated with the
+ * committing transaction marked with JI_WAIT_DATA flag
+ * after we have written the transaction to the journal
+ * but before we write out the commit block.
+ */
+ int (*j_finish_inode_data_buffers)
+ (struct jbd2_inode *);
+
/*
* Journal statistics
*/
--
2.17.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()
2020-09-10 19:31 ` [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-16 16:22 ` Jan Kara
2020-09-16 16:52 ` Jan Kara
0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2020-09-16 16:22 UTC (permalink / raw)
To: Mauricio Faria de Oliveira
Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira
On Thu 10-09-20 16:31:26, Mauricio Faria de Oliveira wrote:
> Introduce journal callbacks to allow different behaviors
> for an inode in journal_submit|finish_inode_data_buffers().
>
> The existing users of the current behavior (ext4, ocfs2)
> are adapted to use the previously exported functions
> that implement the current behavior.
>
> Users are callers of jbd2_journal_inode_ranged_write|wait(),
> which adds the inode to the transaction's inode list with
> the JI_WRITE|WAIT_DATA flags. Only ext4 and ocfs2 in-tree.
>
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> ---
> fs/ext4/super.c | 14 ++++++++++++++
> fs/jbd2/commit.c | 30 ++++++++++++++++++------------
> fs/ocfs2/super.c | 15 +++++++++++++++
> include/linux/jbd2.h | 25 ++++++++++++++++++++++++-
> 4 files changed, 71 insertions(+), 13 deletions(-)
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index ea425b49b345..7303839d7ad9 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -472,6 +472,16 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
> spin_unlock(&sbi->s_md_lock);
> }
>
> +static int ext4_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
> +{
> + return jbd2_journal_submit_inode_data_buffers(jinode);
> +}
> +
> +static int ext4_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
> +{
> + return jbd2_journal_finish_inode_data_buffers(jinode);
> +}
> +
No need for these ext4 wrappers. They just obfuscate code... Ditto for
ocfs2 below.
> @@ -1111,6 +1113,27 @@ struct journal_s
> void (*j_commit_callback)(journal_t *,
> transaction_t *);
>
> + /**
> + * @j_submit_inode_data_buffers:
> + *
> + * This function is called for all inodes associated with the
> + * committing transaction marked with JI_WRITE_DATA flag
> + * before we start to write out the transaction to the journal.
> + */
> + int (*j_submit_inode_data_buffers)
> + (struct jbd2_inode *);
> +
> + /**
> + * @j_finish_inode_data_buffers:
> + *
> + * This function is called for all inodes associated with the
> + * committing transaction marked with JI_WAIT_DATA flag
> + * after we have written the transaction to the journal
> + * but before we write out the commit block.
> + */
> + int (*j_finish_inode_data_buffers)
> + (struct jbd2_inode *);
> +
Having the callbacks in the journal_s will not work if we have inodes with
data journalling on a filesystem mounted in data=ordered mode. The
callbacks really need to be a per-inode thing so I'd add them to struct
jbd2_inode.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()
2020-09-16 16:22 ` Jan Kara
@ 2020-09-16 16:52 ` Jan Kara
0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2020-09-16 16:52 UTC (permalink / raw)
To: Mauricio Faria de Oliveira
Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira
On Wed 16-09-20 18:22:40, Jan Kara wrote:
> On Thu 10-09-20 16:31:26, Mauricio Faria de Oliveira wrote:
> > @@ -1111,6 +1113,27 @@ struct journal_s
> > void (*j_commit_callback)(journal_t *,
> > transaction_t *);
> >
> > + /**
> > + * @j_submit_inode_data_buffers:
> > + *
> > + * This function is called for all inodes associated with the
> > + * committing transaction marked with JI_WRITE_DATA flag
> > + * before we start to write out the transaction to the journal.
> > + */
> > + int (*j_submit_inode_data_buffers)
> > + (struct jbd2_inode *);
> > +
> > + /**
> > + * @j_finish_inode_data_buffers:
> > + *
> > + * This function is called for all inodes associated with the
> > + * committing transaction marked with JI_WAIT_DATA flag
> > + * after we have written the transaction to the journal
> > + * but before we write out the commit block.
> > + */
> > + int (*j_finish_inode_data_buffers)
> > + (struct jbd2_inode *);
> > +
>
> Having the callbacks in the journal_s will not work if we have inodes with
> data journalling on a filesystem mounted in data=ordered mode. The
> callbacks really need to be a per-inode thing so I'd add them to struct
> jbd2_inode.
Oh, now I see that you properly handle this in the callbacks. So I retract
this objection. So feel free to add:
Reviewed-by: Jan Kara <jack@suse.cz>
after removing those pointless wrappers.
Hoonza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC PATCH v3 3/3] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()
2020-09-10 19:31 [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
2020-09-10 19:31 ` [RFC PATCH v3 1/3] jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
2020-09-10 19:31 ` [RFC PATCH v3 2/3] jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-10 19:31 ` Mauricio Faria de Oliveira
2020-09-16 16:59 ` Jan Kara
2020-09-16 16:15 ` [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit Jan Kara
3 siblings, 1 reply; 9+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-09-10 19:31 UTC (permalink / raw)
To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira
This implements journal callbacks j_submit|finish_inode_data_buffers()
with different behavior for data=journal: to write-protect pages under
commit, preventing changes to buffers writeably mapped to userspace.
If a buffer's content changes between commit's checksum calculation
and write-out to disk, it can cause journal recovery/mount failures
upon a kernel crash or power loss.
[ 27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support!
[ 27.339492] JBD2: Invalid checksum recovering data block 8705 in log
[ 27.342716] JBD2: recovery failed
[ 27.343316] EXT4-fs (loop0): error loading journal
mount: /ext4: can't read superblock on /dev/loop0.
In j_submit_inode_data_buffers() we write-protect the inode's pages
with write_cache_pages() and redirty w/ writepage callback if needed.
In j_finish_inode_data_buffers() there is nothing do to.
And in order to use the callbacks, inodes are added to the inode list
in transaction in __ext4_journalled_writepage() and ext4_page_mkwrite().
In ext4_page_mkwrite() we must make sure that:
1) the inode is always added to the list;
thus we skip the 'all buffers mapped' optimization on data=journal;
2) the buffers are attached to transaction as dirty;
as already done in __ext4_journalled_writepage().
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Suggested-by: Jan Kara <jack@suse.cz>
Reported-by: Dann Frazier <dann.frazier@canonical.com>
---
fs/ext4/inode.c | 29 ++++++++++++++------
fs/ext4/super.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 91 insertions(+), 10 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bf596467c234..fa4109da056c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1910,6 +1910,9 @@ static int __ext4_journalled_writepage(struct page *page,
err = ext4_walk_page_buffers(handle, page_bufs, 0, len, NULL,
write_end_fn);
}
+ if (ret == 0)
+ ret = err;
+ err = ext4_jbd2_inode_add_write(handle, inode, 0, len);
if (ret == 0)
ret = err;
EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
@@ -6004,9 +6007,12 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
len = PAGE_SIZE;
/*
* Return if we have all the buffers mapped. This avoids the need to do
- * journal_start/journal_stop which can block and take a long time
+ * journal_start/journal_stop which can block and take a long time.
+ *
+ * This cannot be done for data journalling, as we have to add the
+ * inode to the transaction's list to writeprotect pages on commit.
*/
- if (page_has_buffers(page)) {
+ if (page_has_buffers(page) && !ext4_should_journal_data(inode)) {
if (!ext4_walk_page_buffers(NULL, page_buffers(page),
0, len, NULL,
ext4_bh_unmapped)) {
@@ -6032,12 +6038,14 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
err = block_page_mkwrite(vma, vmf, get_block);
if (!err && ext4_should_journal_data(inode)) {
if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
- PAGE_SIZE, NULL, do_journal_get_write_access)) {
- unlock_page(page);
- ret = VM_FAULT_SIGBUS;
- ext4_journal_stop(handle);
- goto out;
- }
+ PAGE_SIZE, NULL, do_journal_get_write_access))
+ goto out_err;
+ /* Make sure buffers are attached to the transaction as dirty */
+ if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
+ PAGE_SIZE, NULL, write_end_fn))
+ goto out_err;
+ if (ext4_jbd2_inode_add_write(handle, inode, 0, PAGE_SIZE))
+ goto out_err;
ext4_set_inode_state(inode, EXT4_STATE_JDATA);
}
ext4_journal_stop(handle);
@@ -6049,6 +6057,11 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(inode->i_sb);
return ret;
+out_err:
+ unlock_page(page);
+ ret = VM_FAULT_SIGBUS;
+ ext4_journal_stop(handle);
+ goto out;
}
vm_fault_t ext4_filemap_fault(struct vm_fault *vmf)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7303839d7ad9..528b5e20b71c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -472,14 +472,82 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
spin_unlock(&sbi->s_md_lock);
}
+/*
+ * This writepage callback for write_cache_pages()
+ * takes care of a few cases after page cleaning.
+ *
+ * write_cache_pages() already checks for dirty pages
+ * and calls clear_page_dirty_for_io(), which we want,
+ * to write protect the pages.
+ *
+ * However, we have to redirty a page in these cases:
+ * 1) some buffer is dirty (needs checkpointing)
+ * 2) some buffer is not part of the committing transaction
+ * 3) some buffer already has b_next_transaction set
+ */
+
+static int ext4_journalled_writepage_callback(struct page *page,
+ struct writeback_control *wbc,
+ void *data)
+{
+ transaction_t *transaction = (transaction_t *) data;
+ struct buffer_head *bh, *head;
+ struct journal_head *jh;
+
+ bh = head = page_buffers(page);
+ do {
+ jh = bh2jh(bh);
+ if (buffer_dirty(bh) ||
+ (jh && (jh->b_transaction != transaction ||
+ jh->b_next_transaction))) {
+ redirty_page_for_writepage(wbc, page);
+ goto out;
+ }
+ } while ((bh = bh->b_this_page) != head);
+
+out:
+ return AOP_WRITEPAGE_ACTIVATE;
+}
+
+static int ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode)
+{
+ struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+ transaction_t *transaction = jinode->i_transaction;
+ loff_t dirty_start = jinode->i_dirty_start;
+ loff_t dirty_end = jinode->i_dirty_end;
+
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_ALL,
+ .nr_to_write = ~0ULL,
+ .range_start = dirty_start,
+ .range_end = dirty_end,
+ };
+
+ return write_cache_pages(mapping, &wbc,
+ ext4_journalled_writepage_callback,
+ transaction);
+}
+
static int ext4_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
{
- return jbd2_journal_submit_inode_data_buffers(jinode);
+ int ret;
+
+ if (ext4_should_journal_data(jinode->i_vfs_inode))
+ ret = ext4_journalled_submit_inode_data_buffers(jinode);
+ else
+ ret = jbd2_journal_submit_inode_data_buffers(jinode);
+
+ return ret;
}
static int ext4_journal_finish_inode_data_buffers(struct jbd2_inode *jinode)
{
- return jbd2_journal_finish_inode_data_buffers(jinode);
+ int ret = 0;
+
+ if (!ext4_should_journal_data(jinode->i_vfs_inode))
+ ret = jbd2_journal_finish_inode_data_buffers(jinode);
+
+ return ret;
}
static bool system_going_down(void)
--
2.17.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC PATCH v3 3/3] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()
2020-09-10 19:31 ` [RFC PATCH v3 3/3] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-16 16:59 ` Jan Kara
0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2020-09-16 16:59 UTC (permalink / raw)
To: Mauricio Faria de Oliveira
Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira
On Thu 10-09-20 16:31:27, Mauricio Faria de Oliveira wrote:
> This implements journal callbacks j_submit|finish_inode_data_buffers()
> with different behavior for data=journal: to write-protect pages under
> commit, preventing changes to buffers writeably mapped to userspace.
>
> If a buffer's content changes between commit's checksum calculation
> and write-out to disk, it can cause journal recovery/mount failures
> upon a kernel crash or power loss.
>
> [ 27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support!
> [ 27.339492] JBD2: Invalid checksum recovering data block 8705 in log
> [ 27.342716] JBD2: recovery failed
> [ 27.343316] EXT4-fs (loop0): error loading journal
> mount: /ext4: can't read superblock on /dev/loop0.
>
> In j_submit_inode_data_buffers() we write-protect the inode's pages
> with write_cache_pages() and redirty w/ writepage callback if needed.
>
> In j_finish_inode_data_buffers() there is nothing do to.
>
> And in order to use the callbacks, inodes are added to the inode list
> in transaction in __ext4_journalled_writepage() and ext4_page_mkwrite().
>
> In ext4_page_mkwrite() we must make sure that:
>
> 1) the inode is always added to the list;
> thus we skip the 'all buffers mapped' optimization on data=journal;
>
> 2) the buffers are attached to transaction as dirty;
> as already done in __ext4_journalled_writepage().
>
> Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> Reported-by: Dann Frazier <dann.frazier@canonical.com>
> ---
> fs/ext4/inode.c | 29 ++++++++++++++------
> fs/ext4/super.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++--
> 2 files changed, 91 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index bf596467c234..fa4109da056c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1910,6 +1910,9 @@ static int __ext4_journalled_writepage(struct page *page,
> err = ext4_walk_page_buffers(handle, page_bufs, 0, len, NULL,
> write_end_fn);
> }
> + if (ret == 0)
> + ret = err;
> + err = ext4_jbd2_inode_add_write(handle, inode, 0, len);
> if (ret == 0)
> ret = err;
> EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
> @@ -6004,9 +6007,12 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
> len = PAGE_SIZE;
> /*
> * Return if we have all the buffers mapped. This avoids the need to do
> - * journal_start/journal_stop which can block and take a long time
> + * journal_start/journal_stop which can block and take a long time.
> + *
> + * This cannot be done for data journalling, as we have to add the
> + * inode to the transaction's list to writeprotect pages on commit.
> */
> - if (page_has_buffers(page)) {
> + if (page_has_buffers(page) && !ext4_should_journal_data(inode)) {
> if (!ext4_walk_page_buffers(NULL, page_buffers(page),
> 0, len, NULL,
> ext4_bh_unmapped)) {
> @@ -6032,12 +6038,14 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
> err = block_page_mkwrite(vma, vmf, get_block);
> if (!err && ext4_should_journal_data(inode)) {
> if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
> - PAGE_SIZE, NULL, do_journal_get_write_access)) {
> - unlock_page(page);
> - ret = VM_FAULT_SIGBUS;
> - ext4_journal_stop(handle);
> - goto out;
> - }
> + PAGE_SIZE, NULL, do_journal_get_write_access))
> + goto out_err;
> + /* Make sure buffers are attached to the transaction as dirty */
> + if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
> + PAGE_SIZE, NULL, write_end_fn))
> + goto out_err;
I think here we need to be a bit more careful. As I wrote in my answer to
cover letter we cannot call block_page_mkwrite(). Instead we need to lock
the page, compute 'len' again from i_size, then call __block_write_begin()
to map (or allocate) and read blocks, and then ext4_walk_page_buffers()
which needs to walk from 0 to len. And then unlock the page.
> + if (ext4_jbd2_inode_add_write(handle, inode, 0, PAGE_SIZE))
> + goto out_err;
> ext4_set_inode_state(inode, EXT4_STATE_JDATA);
> }
> ext4_journal_stop(handle);
> @@ -6049,6 +6057,11 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
> up_read(&EXT4_I(inode)->i_mmap_sem);
> sb_end_pagefault(inode->i_sb);
> return ret;
> +out_err:
> + unlock_page(page);
> + ret = VM_FAULT_SIGBUS;
> + ext4_journal_stop(handle);
> + goto out;
> }
Otherwise the patch looks good to me!
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit
2020-09-10 19:31 [RFC PATCH v3 0/3] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
` (2 preceding siblings ...)
2020-09-10 19:31 ` [RFC PATCH v3 3/3] ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() Mauricio Faria de Oliveira
@ 2020-09-16 16:15 ` Jan Kara
3 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2020-09-16 16:15 UTC (permalink / raw)
To: Mauricio Faria de Oliveira
Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira
Hi Mauricio!
On Thu 10-09-20 16:31:24, Mauricio Faria de Oliveira wrote:
> This series implements your suggestions for the RFC PATCH v2 set,
> which we mostly talked about in cover letter [1] and PATCH 3 [2].
> (I added Suggested-by: tags, by the way, for due credit.)
>
> It looks almost good on fstests: zero regressions on data=ordered,
> and only one regression in data=journal (generic/347); I'll check.
> (That's with ext4; I'll check ocfs2, but it's only a minor change.)
Glad to hear that!
> Testing with 'stress-ng --mmap <N> --mmap-file' runs well for days,
> but it occasionally hits:
>
> JBD2: Spotted dirty metadata buffer (dev = vdc, blocknr = 74775).
> There's a risk of filesystem corruption in case of system crash.
>
> I added dump_stack() in warn_dirty_buffer(), and it usually comes
> from jbd2_journal_file_buffer(, BJ_Forget) in the commit function.
> When filing from BJ_Shadow to BJ_Forget.. so it possibly happened
> while BH_Shadow was still set!
>
> So I instrumented [test_]set_buffer_dirty() macros to dump_stack()
> if BH_Shadow is set (i.e. buffer being set dirty during write-out.)
>
> This showed that the occasional BH_Dirty setter with BH_Shadow set
> is block_page_mkwrite() in ext4_page_mkwrite(). And it seems right,
> because it's called before do_journal_get_write_access() (where we
> check for/wait on BH_Shadow.)
>
> ext4_page_mkwrite():
>
> err = block_page_mkwrite(vma, vmf, get_block);
> if (!err && ext4_should_journal_data(inode)) {
> if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
> PAGE_SIZE, NULL, do_journal_get_write_access)) {
>
> The patches didn't directly break this, they only allow this code
> to run more often as we disabled an early-return optimization for
> the case 'all buffers mapped' in data=journal (question 2 in [1]):
>
> ext4_page_mkwrite():
>
> * Return if we have all the buffers mapped.
> ...
> - if (page_has_buffers(page)) {
> + if (page_has_buffers(page) && !ext4_should_journal_data(inode)) {
>
>
> In order to confirm it, I built the unpatched v5.9-rc4 kernel, with
> just the change to disable that optimization in data=journal -- and
> it hit that occasionally too ("JBD2: Spotted dirty metadata buffer".)
>
> I was naive enough to mindlessly try to swap the order of those two
> statements, in hopes that do_journal_get_write_access() should wait
> for BH_Shadow to clear, and then we just call block_page_mkwrite().
> BUT it trips into the BUG() check in page_buffers() in the former.
Yeah, you're right that code is wrong for data=journal case. We cannot
really use block_page_mkwrite() for it - we rather need to use there
something like:
__block_write_begin(page, pos, len, ext4_get_block);
ext4_walk_page_buffers(handle, page_buffers(page),
from, to, NULL,
do_journal_get_write_access);
ext4_walk_page_buffers(handle, page_buffers(page), from, to, NULL,
write_end_fn);
or something like that, I guess you get the idea...
Actually, I think data=journal mode should get it's own .page_mkwrite
handler because the sharing between data=journal handling and the other
cases is pretty minimal.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 9+ messages in thread