All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit
@ 2020-08-10  1:02 Mauricio Faria de Oliveira
  2020-08-10  1:02 ` [RFC PATCH v2 1/5] jbd2: test case for ext4 data=journal/mmap() journal corruption Mauricio Faria de Oliveira
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-08-10  1:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

Hi Jan,

Upfront, thanks again for taking the time about a month ago to provide
your understanding of this problem and suggestions on how to implement
this fix [1].

I'd like to ask for your comments on the patchset so far, and have a few questions.

First, there is a test case to reproduce the issue:

Essentially, it enlarges the race window during journal commit, between
buffer checksum calculation and commit to disk, for userspace to modify
mmap()ed buffers, then panic afterwards.  Later, journal recovery fails
due to invalid checksum.

Patch 1 is the kernel hack to allow it.
The last 2 files in the series are the a setup script to create/mount
a loopdev w/ ext4 data=journal filesystem, and the C test case for it.

Test is essentially:

    fd = open("file");
    addr = mmap(fd);
    pwrite(fd, "a", 1, 0);
    addr[0] = 'a';
    press_enter(); // when asked for.
    addr[0] = 'b';

Usage:

    $ gcc -o test test.c
    $ ./ext4.sh && cd /ext4 && ~/test

    Press enter to change buffer contents
    # wait for periodic journal commit
    [   26.286197] TESTCASE: Cookie found. Waiting 5 seconds for changes.
    # now press enter

    Buffer contents changed
    Press enter to change buffer contents
    # just wait
    [   31.370558] TESTCASE: Cookie eaten. Resumed.
    [   31.375906] Kernel panic - not syncing: TESTCASE: checksum changed; commit record done; panic!
    [   31.379799] CPU: 1 PID: 649 Comm: jbd2/loop0-8 Not tainted 5.8.0.mfo.1 #111
    ...
    [   31.400643] Rebooting in 5 seconds..

    $ ./ext4.sh mount
    ...
    + sudo mount -o data=journal /dev/loop0 /ext4
    [   27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support!
    [   27.339492] JBD2: Invalid checksum recovering data block 8705 in log
    [   27.342716] JBD2: recovery failed
    [   27.343316] EXT4-fs (loop0): error loading journal
    mount: /ext4: can't read superblock on /dev/loop0.


Second, this is a recap of your solution implementation suggestion [1], with line breaks added for clarity:

    """
    So I'd suggest a somewhat different solution.

    Instead of trying to mark, when page is under writeback, do it the other way around
    and make jbd2 kick ext4 on transaction commit to writeprotect journalled pages.

    Then, because of do_journal_get_write_access() in ext4_page_mkwrite() we are sure pages
    cannot be writeably mapped into page tables until transaction commit finishes
    (or we'll copy data to commit to b_frozen_data).

    Now let me explain a bit more the "make jbd2 kick ext4 on transaction
    commit to writeprotect journalled pages" part.

    I think we could mostly reuse the data=ordered machinery for that.

    data=ordered machinery tracks with each running transaction also a list of inodes
    for which we need to flush data pages before doing commit of metadata into the journal.

    Now we could use the same mechanism for data=journal mode -
    we'd track in the inode list inodes with writeably mapped pages
    and on transaction commit we need to writeprotect these pages using clear_page_dirty_for_io().

    Probably the best place to add inode to this list is ext4_journalled_writepage().

    To do the commit handling we probably need to introduce callbacks that jbd2 will call
    instead of journal_submit_inode_data_buffers() in journal_submit_data_buffers()
    and instead of filemap_fdatawait_range_keep_errors() in journal_finish_inode_data_buffers().

    In data=ordered mode ext4 will just do what jbd2 does in its callback,

    when inode's data should be journalled, the callback will
    use write_cache_pages() to iterate and writeprotect all dirty pages.

    The writepage callback just returns AOP_WRITEPAGE_ACTIVATE,

    some care needs to be taken to redirty a page in the writepage callback
     A) if not all its underlying buffers are part of the committing transaction
     B) or if some buffer already has b_next_transaction set
        (since in that case the page was already dirtied also against the following transaction).

    But it should all be reasonably doable.
    """

[1] https://lore.kernel.org/linux-ext4/20200610132139.GG12551@quack2.suse.cz/

If I understood it correctly:
- Patch 2 implements the callback mechanism.
- Patch 3 implements the write-protect code.
- Patch 4 has changes to ext4_page_mkwrite() discussed below.
- Patch 5 introduces debug messages for RFC discussion only.

Third, my changes/observations/questions:

1) > Probably the best place to add inode to this list is ext4_journalled_writepage().

I think we also need it in ext4_page_mkwrite(), right?  See the test case, for example:

    fd = open("file");
    addr = mmap(fd);
    pwrite(fd, "a", 1, 0);
    addr[0] = 'a';
    press_enter(); // when asked for.
    addr[0] = 'b';

Here pwrite() led to ext4_write_begin() which added the inode to a transaction (not to the transaction's inode list).

Now the jbd2 thread commits the transaction, before any writeback work or msync() call.
So __ext4_journalled_writepage() didn't get called; the page is still writeably mapped.

And it could be changed / problem reproduced with the test case.
Added some debug messages and line breaks for clarity:

    $ cd /ext4 && ~/test
    <...>
    [   20.668889] TESTDBG: ext4_write_begin() :: journal started: inode ffff8f72eda5fca8, txn ffff8f72f5ea0300
    [   20.670593] TESTDBG: do_get_write_access() :: entry for bh: ffff8f72edae2d68, offset: 0000, page ffffe50288a0a200, inode: ffff8f72eda5fca8
    [   20.672018] TESTDBG: ext4_journalled_write_end() :: journal stopped: inode ffff8f72eda5fca8

    [   20.673769] TESTDBG: ext4_page_mkwrite() :: entry for inode ffff8f72eda5fca8
    [   20.675016] TESTDBG: ext4_bh_tdbg() :: bh: ffff8f72edae2d68, data offset: 0000, page: ffffe50288a0a200, inode: ffff8f72eda5fca8
    [   20.677639] TESTDBG: ext4_page_mkwrite() :: returning; all buffers mapped for inode ffff8f72eda5fca8

    Press enter to change buffer contents
    [   26.250842] TESTDBG: journal_submit_data_buffers() :: entry for transaction: 0xffff8f72f5ea0300
    <...>
    [   26.283521] TESTDBG: jbd2_journal_write_metadata_buffer() :: copy out: done/need 0/0, bh: ffff8f72edae2d68, offset: 0000, page: ffffe50288a0a200, inode: ffff8f72eda5fca8
    [   26.286197] TESTCASE: Cookie found. Waiting 5 seconds for changes.
    # Press enter

    Buffer contents changed
    Press enter to change buffer contents
    [   31.370558] TESTCASE: Cookie eaten. Resumed.
    [   31.375906] Kernel panic - not syncing: TESTCASE: checksum changed; commit record done; panic!

So, we should have ext4_page_mkwrite() add the inode to the transaction's inode list to fix this case, IIUIC.


2) Do not return early in ext4_page_mkwrite() 'if we have all the buffers mapped'?

There's this check in ext4_page_mkwrite():

        /*
         * Return if we have all the buffers mapped. This avoids the need to do
         * journal_start/journal_stop which can block and take a long time
         */

Corresponding to this line in the debug messages:

    [   20.677639] TESTDBG: ext4_page_mkwrite() :: returning; all buffers mapped for inode ffff8f72eda5fca8

That returns early, *before* do_journal_get_write_access(), which is needed in data=journal,
regardless of whether all buffers are mapped: we could exit ext4_page_mkwrite() then change
buffers under commit during the race window.
    
So, it seems we should ignore that for data=journal, which allows us to ext4_journal_start()
and get a transaction handle, used to add the inode to the transaction's inode list (per #1.)

This is Patch 6.

With the changes from 1) and 2), the inode is added to the transaction's inode list,
and the callback does write-protect the page correctly.

When trying to change the buffer contents later (pressing enter) there _is_ a wait in
ext4_page_mkwrite() as expected!  But it's first at the file_update_time() because it
calls do_get_write_access() for that inode, which is also in the transaction.

I presume that a commit can start after file_update_time() and before ext4_journal_start()
in ext4_page_mkwrite(), so we'd still have to keep these changes to ext4_page_mkwrite() ?

    $ cd /ext4 && ~/test
    <...>
    [  142.264078] TESTDBG: do_get_write_access() :: entry for bh: ffff98e12f35fa90, offset: 0000, page ffffd06548c1c400, inode: ffff98e136849118

    [  142.266110] TESTDBG: ext4_write_begin() :: journal started: inode ffff98e12f2c0508, txn ffff98e12f1ec200
    [  142.268063] TESTDBG: do_get_write_access() :: entry for bh: ffff98e12f35f6e8, offset: 0000, page ffffd06548ccc100, inode: ffff98e12f2c0508
    [  142.269728] TESTDBG: ext4_journalled_write_end() :: journal stopped: inode ffff98e12f2c0508

    [  142.271894] TESTDBG: ext4_page_mkwrite() :: entry for inode ffff98e12f2c0508
    [  142.273498] TESTDBG: ext4_bh_tdbg() :: bh: ffff98e12f35f6e8, data offset: 0000, page: ffffd06548ccc100, inode: ffff98e12f2c0508
    [  142.276779] TESTDBG: ext4_page_mkwrite() :: before djgwa(), for inode ffff98e12f2c0508
    [  142.276781] TESTDBG: ext4_page_mkwrite() :: after  djgwa(), for inode ffff98e12f2c0508
    [  142.278273] TESTDBG: ext4_page_mkwrite() :: Added inode to txn list: inode ffff98e12f2c0508, txn = ffff98e12f1ec200, err = 0    
    Press enter to change buffer contents
    # Waiting...
    [  148.140148] TESTDBG: journal_submit_data_buffers() :: entry for transaction: 0xffff98e12f1ec200
    [  148.145770] TESTDBG: journal_submit_data_buffers() :: txn list has inode ffff98e12f2c0508 (write data flag: 0x2)
    [  148.148453] TESTDBG: ext4_journalled_submit_inode_data_buffers() :: entry for inode: ffff98e12f2c0508
    [  148.148462] TESTDBG: ext4_journalled_writepage_callback() :: entry for bh ffff98e12f35f6e8, page ffffd06548ccc100, inode: ffff98e12f2c0508
    [  148.150920] TESTDBG: ext4_journalled_writepage_callback() :: redirty for bh ffff98e12f35f680, jh, 0000000000000000, txn 0000000000000000, next_txn 0000000000000000
    <...>
    [  148.168741] TESTDBG: jbd2_journal_write_metadata_buffer() :: copy out: done/need 0/0, bh: ffff98e12f35fa90, offset: 0000, page: ffffd06548c1c400, inode: ffff98e136849118
    [  148.170897] TESTDBG: jbd2_journal_write_metadata_buffer() :: copy out: done/need 0/0, bh: ffff98e12f35f6e8, offset: 0000, page: ffffd06548ccc100, inode: ffff98e12f2c0508
    [  148.173067] TESTCASE: Cookie found. Waiting 5 seconds for changes.
    # Press enter.
        
    [  149.553705] TESTDBG: ext4_page_mkwrite() :: entry for inode ffff98e12f2c0508
    [  149.553719] TESTDBG: do_get_write_access() :: entry for bh: ffff98e132905d00, offset: 0000, page ffffd06548c7f580, inode: ffff98e136849118
    [  153.259932] TESTCASE: Cookie eaten. Resumed.
    # Note that dgwa() above waited the 5 seconds to move on, but for another inode (ext4_page_mkwrite() -> file_update_time()), not the file's inode

    [  153.264679] TESTDBG: ext4_journalled_finish_inode_data_buffers() :: entry for inode: ffff98e12f2c0508
    [  153.264885] TESTDBG: do_get_write_access() :: entry for bh: ffff98e132905d00, offset: 0000, page ffffd06548c7f580, inode: ffff98e136849118
    [  153.267647] TESTDBG: ext4_bh_tdbg() :: bh: ffff98e12f35f6e8, data offset: 0000, page: ffffd06548ccc100, inode: ffff98e12f2c0508
    [  153.275150] TESTDBG: ext4_page_mkwrite() :: before djgwa(), for inode ffff98e12f2c0508
    [  153.275153] TESTDBG: do_get_write_access() :: entry for bh: ffff98e12f35f6e8, offset: 0000, page ffffd06548ccc100, inode: ffff98e12f2c0508
    [  153.277669] TESTDBG: ext4_page_mkwrite() :: after  djgwa(), for inode ffff98e12f2c0508
    Buffer contents changed
    Press enter to change buffer contents
    [  153.280907] TESTDBG: ext4_page_mkwrite() :: Added inode to txn list: inode ffff98e12f2c0508, txn = ffff98e135db5d00, err = 0


3) When checking to redirty the page in the writepage callback,
   does a buffer without a journal head means we should redirty
   the page? (for the reason it's not part of the committing txn)

For example, in this case, the writepage callback is called,
checks the first buffer head is part of the committing txn,
but the next one has no journal head, so it's not part of any
transaction (including the committing txn, of course) so it
does redirty the page.

    [  148.148462] TESTDBG: ext4_journalled_writepage_callback() :: entry for bh ffff98e12f35f6e8, page ffffd06548ccc100, inode: ffff98e12f2c0508
    [  148.150920] TESTDBG: ext4_journalled_writepage_callback() :: redirty for bh ffff98e12f35f680, jh, 0000000000000000, txn 0000000000000000, next_txn 0000000000000000

4) Should we clear the PageChecked bit?

We don't clear the PageChecked bit, thus later when the writeback
work kicks in, __ext4_journalled_writepage() adds the page to the
transaction's inode list again.

Since the page is clean, it just goes into the submit data buffers
callback, but write_cache_pages() returns before the writepage callback.

Should we try to provent that by, say, clearing the pagechecked bit
in case we don't have to redirty the page (in the writepage callback) ?

5) Inodes with inline data.  I haven't checked in detail yet, but
would appreciate if you have a brief explanation about them, and
if they need special handling since they apparently don't get
do_journal_get_write_access() called (eg, see __ext4_journalled_writepage()).

Thank you very much,
Mauricio

Mauricio Faria de Oliveira (5):
  jbd2: test case for ext4 data=journal/mmap() journal corruption
  jbd2: introduce journal callbacks j_submit|finish_inode_data_buffers
  ext4: data=journal: write-protect pages on submit inode data buffers
    callback
  ext4: data=journal: add inode to transaction inode list in
    ext4_page_mkwrite()
  ext4/jbd2: debugging messages

 fs/ext4/inode.c       | 42 ++++++++++++++++++++++--
 fs/ext4/super.c       | 75 +++++++++++++++++++++++++++++++++++++++++++
 fs/jbd2/commit.c      | 55 ++++++++++++++++++++++++++++---
 fs/jbd2/journal.c     |  5 +++
 fs/jbd2/transaction.c |  4 +++
 include/linux/jbd2.h  | 21 +++++++++++-
 6 files changed, 194 insertions(+), 8 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 1/5] jbd2: test case for ext4 data=journal/mmap() journal corruption
  2020-08-10  1:02 [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
@ 2020-08-10  1:02 ` Mauricio Faria de Oliveira
  2020-08-18 14:38   ` Jan Kara
  2020-08-10  1:02 ` [RFC PATCH v2 2/5] jbd2: introduce journal callbacks j_submit|finish_inode_data_buffers Mauricio Faria de Oliveira
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-08-10  1:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

This checks during journal commit, right after calculating the
checksum of a buffer head, whether its contents match the 'BUG'
string (the cookie string in the test case userspace part.)

If so, it sleeps 5 seconds for such contents to change (i.e.,
so that the actual checksum changes from what was calculated.)

And if it changed, set a flag to panic after committing to disk.

Then, on filesystem remount/journal recovery there is an invalid
checksum error, and recovery fails:

  $ sudo mount -o data=journal,journal_checksum $DEV $MNT
  [ 100.832223] EXT4-fs: Warning: mounting with data=journal disables
  delayed allocation, dioread_nolock, and O_DIRECT support!
  [ 100.837488] JBD2: Invalid checksum recovering data block 8706 in log
  [ 100.842010] JBD2: recovery failed
  [ 100.843045] EXT4-fs (loop0): error loading journal
  mount: /ext4: can't read superblock on /dev/loop0.
---
 fs/jbd2/commit.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 6d2da8ad0e6f..51f713089e35 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -26,6 +26,11 @@
 #include <linux/bitops.h>
 #include <trace/events/jbd2.h>
 
+#include <linux/printk.h>
+#include <linux/delay.h>
+
+static journal_t *force_panic;
+
 /*
  * IO end handler for temporary buffer_heads handling writes to the journal.
  */
@@ -331,14 +336,35 @@ static void jbd2_block_tag_csum_set(journal_t *j, journal_block_tag_t *tag,
 	__u32 csum32;
 	__be32 seq;
 
+	// For the testcase
+	__u32 csum32_later;
+	__u8 *bh_data;
+
 	if (!jbd2_journal_has_csum_v2or3(j))
 		return;
 
 	seq = cpu_to_be32(sequence);
 	addr = kmap_atomic(page);
 	csum32 = jbd2_chksum(j, j->j_csum_seed, (__u8 *)&seq, sizeof(seq));
+	csum32_later = csum32; // Copy csum32 to check again later
 	csum32 = jbd2_chksum(j, csum32, addr + offset_in_page(bh->b_data),
 			     bh->b_size);
+
+	// Check for testcase cookie 'BUG' in the buffer_head data.
+	bh_data = addr + offset_in_page(bh->b_data);
+	if (bh_data[0] == 'B' &&
+	    bh_data[1] == 'U' &&
+	    bh_data[2] == 'G') {
+		pr_info("TESTCASE: Cookie found. Waiting 5 seconds for changes.\n");
+		msleep(5000);
+		pr_info("TESTCASE: Cookie eaten. Resumed.\n");
+	}
+
+	// Check the checksum again for changes/panic after commit.
+	csum32_later = jbd2_chksum(j, csum32_later, addr + offset_in_page(bh->b_data), bh->b_size);
+	if (csum32 != csum32_later)
+		force_panic = j;
+
 	kunmap_atomic(addr);
 
 	if (jbd2_has_feature_csum3(j))
@@ -885,6 +911,9 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 		blkdev_issue_flush(journal->j_dev, GFP_NOFS);
 	}
 
+	if (force_panic == journal)
+		panic("TESTCASE: checksum changed; commit record done; panic!\n");
+
 	if (err)
 		jbd2_journal_abort(journal, err);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 2/5] jbd2: introduce journal callbacks j_submit|finish_inode_data_buffers
  2020-08-10  1:02 [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
  2020-08-10  1:02 ` [RFC PATCH v2 1/5] jbd2: test case for ext4 data=journal/mmap() journal corruption Mauricio Faria de Oliveira
@ 2020-08-10  1:02 ` Mauricio Faria de Oliveira
  2020-08-18 14:52   ` Jan Kara
  2020-08-10  1:02 ` [RFC PATCH v2 3/5] ext4: data=journal: write-protect pages on submit inode data buffers callback Mauricio Faria de Oliveira
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-08-10  1:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

Add the callbacks as opt-in to override the default behavior for
the transaction's inode list, instead of moving that code around.

This is important as not only ext4 uses the inode list: ocfs2 too,
via jbd2_journal_inode_ranged_write(), and maybe out-of-tree code.

To opt-out of the default behavior (i.e., to do nothing), one has
to opt-in with a no-op function.
---
 fs/jbd2/commit.c     | 21 ++++++++++++++++-----
 include/linux/jbd2.h | 21 ++++++++++++++++++++-
 2 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 51f713089e35..b98d227b50d8 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -237,10 +237,14 @@ static int journal_submit_data_buffers(journal_t *journal,
 		 * instead of writepages. Because writepages can do
 		 * block allocation  with delalloc. We need to write
 		 * only allocated blocks here.
+		 * This can be overriden with a custom callback.
 		 */
 		trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
-		err = journal_submit_inode_data_buffers(mapping, dirty_start,
-				dirty_end);
+		if (journal->j_submit_inode_data_buffers)
+			err = journal->j_submit_inode_data_buffers(jinode);
+		else
+			err = journal_submit_inode_data_buffers(mapping,
+					dirty_start, dirty_end);
 		if (!ret)
 			ret = err;
 		spin_lock(&journal->j_list_lock);
@@ -274,9 +278,16 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
 			continue;
 		jinode->i_flags |= JI_COMMIT_RUNNING;
 		spin_unlock(&journal->j_list_lock);
-		err = filemap_fdatawait_range_keep_errors(
-				jinode->i_vfs_inode->i_mapping, dirty_start,
-				dirty_end);
+		/*
+		 * Wait for the inode data buffers writeout.
+		 * This can be overriden with a custom callback.
+		 */
+		if (journal->j_finish_inode_data_buffers)
+			err = journal->j_finish_inode_data_buffers(jinode);
+		else
+			err = filemap_fdatawait_range_keep_errors(
+					jinode->i_vfs_inode->i_mapping,
+					dirty_start, dirty_end);
 		if (!ret)
 			ret = err;
 		spin_lock(&journal->j_list_lock);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index d56128df2aff..24efe88eda1b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -628,7 +628,8 @@ struct transaction_s
 	struct journal_head	*t_shadow_list;
 
 	/*
-	 * List of inodes whose data we've modified in data=ordered mode.
+	 * List of inodes whose data we've modified in data=ordered mode
+	 * or whose pages we should write-protect in data=journaled mode.
 	 * [j_list_lock]
 	 */
 	struct list_head	t_inode_list;
@@ -1110,6 +1111,24 @@ struct journal_s
 	void			(*j_commit_callback)(journal_t *,
 						     transaction_t *);
 
+	/**
+	 * @j_submit_inode_data_buffers:
+	 *
+	 * This function is called before flushing metadata buffers.
+	 * This overrides the default behavior (writeout data buffers.)
+	 */
+	int			(*j_submit_inode_data_buffers)
+					(struct jbd2_inode *);
+
+	/**
+	 * @j_finish_inode_data_buffers:
+	 *
+	 * This function is called after flushing metadata buffers.
+	 * This overrides the default behavior (wait writeout.)
+	 */
+	int			(*j_finish_inode_data_buffers)
+					(struct jbd2_inode *);
+
 	/*
 	 * Journal statistics
 	 */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 3/5] ext4: data=journal: write-protect pages on submit inode data buffers callback
  2020-08-10  1:02 [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
  2020-08-10  1:02 ` [RFC PATCH v2 1/5] jbd2: test case for ext4 data=journal/mmap() journal corruption Mauricio Faria de Oliveira
  2020-08-10  1:02 ` [RFC PATCH v2 2/5] jbd2: introduce journal callbacks j_submit|finish_inode_data_buffers Mauricio Faria de Oliveira
@ 2020-08-10  1:02 ` Mauricio Faria de Oliveira
  2020-08-19  8:44   ` Jan Kara
  2020-08-10  1:02 ` [RFC PATCH v2 4/5] ext4: data=journal: add inode to transaction inode list in ext4_page_mkwrite() Mauricio Faria de Oliveira
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-08-10  1:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

This implements the journal's j_submit_inode_data_buffers() callback
to write-protect the inode's pages with write_cache_pages(), and use
a writepage callback to redirty pages with buffers that are not part
of the committing transaction or the next transaction.

And set a no-op function as j_finish_inode_data_buffers() callback
(nothing needed other than the write-protect above.)

Currently, the inode is added to the transaction's inode list in the
__ext4_journalled_writepage() function.
---
 fs/ext4/inode.c |  4 +++
 fs/ext4/super.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 10dd470876b3..978ccde8454f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1911,6 +1911,10 @@ static int __ext4_journalled_writepage(struct page *page,
 		err = ext4_walk_page_buffers(handle, page_bufs, 0, len, NULL,
 					     write_end_fn);
 	}
+	if (ret == 0)
+		ret = err;
+	// XXX: is this correct for inline data inodes?
+	err = ext4_jbd2_inode_add_write(handle, inode, 0, len);
 	if (ret == 0)
 		ret = err;
 	EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 330957ed1f05..38aaac6572ea 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -472,6 +472,66 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
 	spin_unlock(&sbi->s_md_lock);
 }
 
+/*
+ * This writepage callback for write_cache_pages()
+ * takes care of a few cases after page cleaning.
+ *
+ * write_cache_pages() already checks for dirty pages
+ * and calls clear_page_dirty_for_io(), which we want,
+ * to write protect the pages.
+ *
+ * However, we have to redirty a page in two cases:
+ * 1) some buffer is not part of the committing transaction
+ * 2) some buffer already has b_next_transaction set
+ */
+
+static int ext4_journalled_writepage_callback(struct page *page,
+					      struct writeback_control *wbc,
+					      void *data)
+{
+	transaction_t *transaction = (transaction_t *) data;
+	struct buffer_head *bh, *head;
+	struct journal_head *jh;
+
+	// XXX: any chance of !bh here?
+	bh = head = page_buffers(page);
+	do {
+		jh = bh2jh(bh);
+		if (!jh || jh->b_transaction != transaction ||
+		    jh->b_next_transaction) {
+			redirty_page_for_writepage(wbc, page);
+			goto out;
+		}
+	} while ((bh = bh->b_this_page) != head);
+
+out:
+	return AOP_WRITEPAGE_ACTIVATE;
+}
+
+static int ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
+	transaction_t *transaction = jinode->i_transaction;
+	loff_t dirty_start = jinode->i_dirty_start;
+	loff_t dirty_end = jinode->i_dirty_end;
+
+	struct writeback_control wbc = {
+		.sync_mode =  WB_SYNC_ALL,
+		.nr_to_write = mapping->nrpages * 2,
+		.range_start = dirty_start,
+		.range_end = dirty_end,
+        };
+
+	return write_cache_pages(mapping, &wbc,
+				 ext4_journalled_writepage_callback,
+				 transaction);
+}
+
+static int ext4_journalled_finish_inode_data_buffers(struct jbd2_inode *jinode)
+{
+	return 0;
+}
+
 static bool system_going_down(void)
 {
 	return system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF
@@ -4599,6 +4659,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		ext4_msg(sb, KERN_ERR, "can't mount with "
 			"journal_async_commit in data=ordered mode");
 		goto failed_mount_wq;
+	} else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) {
+		sbi->s_journal->j_submit_inode_data_buffers =
+			ext4_journalled_submit_inode_data_buffers;
+		sbi->s_journal->j_finish_inode_data_buffers =
+			ext4_journalled_finish_inode_data_buffers;
 	}
 
 	set_task_ioprio(sbi->s_journal->j_task, journal_ioprio);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 4/5] ext4: data=journal: add inode to transaction inode list in ext4_page_mkwrite()
  2020-08-10  1:02 [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
                   ` (2 preceding siblings ...)
  2020-08-10  1:02 ` [RFC PATCH v2 3/5] ext4: data=journal: write-protect pages on submit inode data buffers callback Mauricio Faria de Oliveira
@ 2020-08-10  1:02 ` Mauricio Faria de Oliveira
  2020-08-10  1:02 ` [RFC PATCH v2 5/5] ext4/jbd2: debugging messages Mauricio Faria de Oliveira
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-08-10  1:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

Since we only add the inode to the transaction's inode list in
__ext4_journalled_writepage(), we depend on msync() or writeback work
(which call it) for the write-protect mechanism to work.

This test snippet shows that, as pwrite() gets the inode into a
transaction (!= than into transaction's inode list), and addr[]
write access gets the page writeably mapped.

    fd = open("file");
    addr = mmap(fd);
    pwrite(fd, "a", 1, 0); // journals inode via ext4_write_begin()
    addr[0] = 'a'; // page is writeably mapped to user space.
    // periodic journal commit / jbd2 thread runs now.
    // __ext4_journalled_writepage() was not called yet.

Now it's possible for a subsequent addr[] write access to race
with the commit function, and possibly hit the window to cause
invalid checksums.
---
 fs/ext4/inode.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 978ccde8454f..ce5464f92a7e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6008,9 +6008,10 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 		len = PAGE_SIZE;
 	/*
 	 * Return if we have all the buffers mapped. This avoids the need to do
-	 * journal_start/journal_stop which can block and take a long time
+	 * journal_start/journal_stop which can block and take a long time. But
+	 * not on data journalling, as we have to add the inode to the txn list.
 	 */
-	if (page_has_buffers(page)) {
+	if (page_has_buffers(page) && !ext4_should_journal_data(inode)) {
 		if (!ext4_walk_page_buffers(NULL, page_buffers(page),
 					    0, len, NULL,
 					    ext4_bh_unmapped)) {
@@ -6043,6 +6044,12 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 			goto out;
 		}
 		ext4_set_inode_state(inode, EXT4_STATE_JDATA);
+		if (ext4_jbd2_inode_add_write(handle, inode, 0, PAGE_SIZE)) {
+			unlock_page(page);
+			ret = VM_FAULT_SIGBUS;
+			ext4_journal_stop(handle);
+			goto out;
+		}
 	}
 	ext4_journal_stop(handle);
 	if (err == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2 5/5] ext4/jbd2: debugging messages
  2020-08-10  1:02 [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
                   ` (3 preceding siblings ...)
  2020-08-10  1:02 ` [RFC PATCH v2 4/5] ext4: data=journal: add inode to transaction inode list in ext4_page_mkwrite() Mauricio Faria de Oliveira
@ 2020-08-10  1:02 ` Mauricio Faria de Oliveira
  2020-08-10  1:48   ` kernel test robot
                     ` (2 more replies)
  2020-08-10  1:02 ` [RFC PATCH v2/SETUP SCRIPT] Mauricio Faria de Oliveira
                   ` (2 subsequent siblings)
  7 siblings, 3 replies; 21+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-08-10  1:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

For code tracking and deubgging purposes; some used in cover letter.

---
 fs/ext4/inode.c       | 27 +++++++++++++++++++++++++++
 fs/ext4/super.c       | 10 ++++++++++
 fs/jbd2/commit.c      |  5 +++++
 fs/jbd2/journal.c     |  5 +++++
 fs/jbd2/transaction.c |  4 ++++
 5 files changed, 51 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ce5464f92a7e..cd01aec87303 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -48,6 +48,17 @@
 
 #include <trace/events/ext4.h>
 
+#include "ext4_jbd2_dbg.h"
+
+static int ext4_bh_tdbg(handle_t *handle, struct buffer_head *bh)
+{
+	struct super_block *sb = bh->b_page->mapping->host->i_sb;
+	tdbg_ext4(sb, "bh: %px, data offset: %04llx, page: %px, inode: %px\n",
+		  bh, (u64) bh->b_data & ((u64)PAGE_SIZE - 1),
+		  bh->b_page, bh->b_page->mapping->host);
+	return 0;
+}
+
 static __u32 ext4_inode_csum(struct inode *inode, struct ext4_inode *raw,
 			      struct ext4_inode_info *ei)
 {
@@ -1193,6 +1204,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
 		ret = __block_write_begin(page, pos, len, ext4_get_block);
 #endif
 	if (!ret && ext4_should_journal_data(inode)) {
+		tdbg_ext4(inode->i_sb, "journal started: inode %px, txn %px", inode, handle->h_transaction);
 		ret = ext4_walk_page_buffers(handle, page_buffers(page),
 					     from, to, NULL,
 					     do_journal_get_write_access);
@@ -1446,6 +1458,7 @@ static int ext4_journalled_write_end(struct file *file,
 			ext4_orphan_del(NULL, inode);
 	}
 
+	tdbg_ext4(inode->i_sb, "journal stopped: inode %px", inode);
 	return ret ? ret : copied;
 }
 
@@ -1917,6 +1930,10 @@ static int __ext4_journalled_writepage(struct page *page,
 	err = ext4_jbd2_inode_add_write(handle, inode, 0, len);
 	if (ret == 0)
 		ret = err;
+	{
+		struct super_block *sb = handle->h_transaction->t_journal->j_private;
+		tdbg_ext4(sb, "Added inode to txn list: inode %px, txn = %px, err = %d", inode, handle->h_transaction, err);
+	}
 	EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
 	err = ext4_journal_stop(handle);
 	if (!ret)
@@ -2035,6 +2052,7 @@ static int ext4_writepage(struct page *page,
 		keep_towrite = true;
 	}
 
+	tdbg_ext4(inode->i_sb, "called for inode %px by comm %s", inode, current->comm);
 	if (PageChecked(page) && ext4_should_journal_data(inode))
 		/*
 		 * It's mmapped pagecache.  Add buffers and journal it.  There
@@ -5969,6 +5987,8 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 	get_block_t *get_block;
 	int retries = 0;
 
+	tdbg_ext4(inode->i_sb, "entry for inode %px", inode);
+
 	if (unlikely(IS_IMMUTABLE(inode)))
 		return VM_FAULT_SIGBUS;
 
@@ -6006,6 +6026,9 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 		len = size & ~PAGE_MASK;
 	else
 		len = PAGE_SIZE;
+
+	ext4_walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, ext4_bh_tdbg);
+
 	/*
 	 * Return if we have all the buffers mapped. This avoids the need to do
 	 * journal_start/journal_stop which can block and take a long time. But
@@ -6018,6 +6041,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 			/* Wait so that we don't change page under IO */
 			wait_for_stable_page(page);
 			ret = VM_FAULT_LOCKED;
+			tdbg_ext4(inode->i_sb, "returning; all buffers mapped for inode %px", inode);
 			goto out;
 		}
 	}
@@ -6036,6 +6060,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 	}
 	err = block_page_mkwrite(vma, vmf, get_block);
 	if (!err && ext4_should_journal_data(inode)) {
+		tdbg_ext4(inode->i_sb, "before djgwa(), for inode %px", inode);
 		if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
 			  PAGE_SIZE, NULL, do_journal_get_write_access)) {
 			unlock_page(page);
@@ -6043,6 +6068,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 			ext4_journal_stop(handle);
 			goto out;
 		}
+		tdbg_ext4(inode->i_sb, "after  djgwa(), for inode %px", inode);
 		ext4_set_inode_state(inode, EXT4_STATE_JDATA);
 		if (ext4_jbd2_inode_add_write(handle, inode, 0, PAGE_SIZE)) {
 			unlock_page(page);
@@ -6050,6 +6076,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
 			ext4_journal_stop(handle);
 			goto out;
 		}
+		tdbg_ext4(inode->i_sb, "Added inode to txn list: inode %px, txn = %px, err = 0", inode, handle->h_transaction);
 	}
 	ext4_journal_stop(handle);
 	if (err == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 38aaac6572ea..7167fcf60b5c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -58,6 +58,8 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/ext4.h>
 
+#include "ext4_jbd2_dbg.h"
+
 static struct ext4_lazy_init *ext4_li_info;
 static struct mutex ext4_li_mtx;
 static struct ratelimit_state ext4_mount_msg_ratelimit;
@@ -492,14 +494,20 @@ static int ext4_journalled_writepage_callback(struct page *page,
 	transaction_t *transaction = (transaction_t *) data;
 	struct buffer_head *bh, *head;
 	struct journal_head *jh;
+	struct super_block *sb = page->mapping->host->i_sb;
 
 	// XXX: any chance of !bh here?
 	bh = head = page_buffers(page);
+	tdbg_ext4(sb, "entry for bh %px, page %px, inode: %px", bh, page, page->mapping->host);
 	do {
 		jh = bh2jh(bh);
 		if (!jh || jh->b_transaction != transaction ||
 		    jh->b_next_transaction) {
 			redirty_page_for_writepage(wbc, page);
+			tdbg_ext4(sb, "redirty for bh %px, jh, %px, txn %px, next_txn %px",
+				  bh, jh,
+				  jh ? jh->b_transaction : NULL,
+				  jh ? jh->b_next_transaction : NULL);
 			goto out;
 		}
 	} while ((bh = bh->b_this_page) != head);
@@ -522,6 +530,7 @@ static int ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode)
 		.range_end = dirty_end,
         };
 
+	tdbg_ext4(jinode->i_vfs_inode->i_sb, "entry for inode: %px", jinode->i_vfs_inode);
 	return write_cache_pages(mapping, &wbc,
 				 ext4_journalled_writepage_callback,
 				 transaction);
@@ -529,6 +538,7 @@ static int ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode)
 
 static int ext4_journalled_finish_inode_data_buffers(struct jbd2_inode *jinode)
 {
+	tdbg_ext4(jinode->i_vfs_inode->i_sb, "entry for inode: %px", jinode->i_vfs_inode);
 	return 0;
 }
 
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index b98d227b50d8..96f0d81eadf9 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -29,6 +29,8 @@
 #include <linux/printk.h>
 #include <linux/delay.h>
 
+#include "../ext4/ext4_jbd2_dbg.h"
+
 static journal_t *force_panic;
 
 /*
@@ -222,11 +224,14 @@ static int journal_submit_data_buffers(journal_t *journal,
 	int err, ret = 0;
 	struct address_space *mapping;
 
+	tdbg_jbd2(journal, "entry for transaction: 0x%px\n", commit_transaction);
 	spin_lock(&journal->j_list_lock);
 	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
 		loff_t dirty_start = jinode->i_dirty_start;
 		loff_t dirty_end = jinode->i_dirty_end;
 
+		tdbg_jbd2(journal, "txn list has inode %px (write data flag: 0x%lx)\n", jinode->i_vfs_inode, (jinode->i_flags & JI_WRITE_DATA));
+
 		if (!(jinode->i_flags & JI_WRITE_DATA))
 			continue;
 		mapping = jinode->i_vfs_inode->i_mapping;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e4944436e733..b86b871ee823 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -48,6 +48,8 @@
 #include <linux/uaccess.h>
 #include <asm/page.h>
 
+#include "../ext4/ext4_jbd2_dbg.h"
+
 #ifdef CONFIG_JBD2_DEBUG
 ushort jbd2_journal_enable_debug __read_mostly;
 EXPORT_SYMBOL(jbd2_journal_enable_debug);
@@ -453,6 +455,9 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 
 	*bh_out = new_bh;
 
+	tdbg_jbd2(transaction->t_journal, "copy out: done/need %d/%d, bh: %px, offset: %04x, page: %px, inode: %px\n",
+	     done_copy_out, need_copy_out, jh2bh(jh_in), new_offset, new_page, new_page->mapping ? new_page->mapping->host : NULL);
+
 	/*
 	 * The to-be-written buffer needs to get moved to the io queue,
 	 * and the original buffer whose contents we are shadowing or
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e91aad3637a2..93a55a228e08 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -30,6 +30,8 @@
 
 #include <trace/events/jbd2.h>
 
+#include "../ext4/ext4_jbd2_dbg.h"
+
 static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh);
 static void __jbd2_journal_unfile_buffer(struct journal_head *jh);
 
@@ -952,6 +954,8 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 repeat:
 	bh = jh2bh(jh);
 
+	tdbg_jbd2(journal, "entry for bh: %px, offset: %04lx, page %px, inode: %px",
+		  bh, offset_in_page(bh->b_data), bh->b_page, bh->b_page->mapping->host);
 	/* @@@ Need to check for errors here at some point. */
 
  	start_lock = jiffies;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [RFC PATCH v2/SETUP SCRIPT]
  2020-08-10  1:02 [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
                   ` (4 preceding siblings ...)
  2020-08-10  1:02 ` [RFC PATCH v2 5/5] ext4/jbd2: debugging messages Mauricio Faria de Oliveira
@ 2020-08-10  1:02 ` Mauricio Faria de Oliveira
  2020-08-10  1:02 ` [RFC PATCH v2/TEST CASE] Mauricio Faria de Oliveira
  2020-08-19  9:27 ` [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit Jan Kara
  7 siblings, 0 replies; 21+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-08-10  1:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

#!/bin/sh
set -ex

IMG=$HOME/ext4.img
MNT=/ext4

if [ "$1" != 'mount' ]; then
	rm -f $IMG
	truncate --size 128M $IMG
fi

DEV=$(sudo losetup --find --show $IMG)

if [ "$1" != 'mount' ]; then
	sudo mkfs.ext4 $DEV
	sudo tune2fs -f -o journal_data $DEV
fi

sudo mkdir -p $MNT
sudo mount -o data=journal $DEV $MNT
sudo chown -R ubuntu:ubuntu $MNT

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [RFC PATCH v2/TEST CASE]
  2020-08-10  1:02 [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
                   ` (5 preceding siblings ...)
  2020-08-10  1:02 ` [RFC PATCH v2/SETUP SCRIPT] Mauricio Faria de Oliveira
@ 2020-08-10  1:02 ` Mauricio Faria de Oliveira
  2020-08-19  9:27 ` [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit Jan Kara
  7 siblings, 0 replies; 21+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-08-10  1:02 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#include <sys/mman.h>

int main() {
	int fd, rc;
	char *addr;
	const int PAGE_SIZE = sysconf(_SC_PAGESIZE);

	rc = unlink("file");
	if (rc < 0 && errno != ENOENT ) {
		perror("unlink");
		return 1;
	}

	fd = open("file", O_RDWR | O_CREAT, S_IRUSR | S_IWUSR);
	if (fd < 0) {
		perror("open");
		return 1;
	}

	addr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
	if (addr < 0) {
		perror("mmap");
		return 1;
	}

	rc = pwrite(fd, "a", 1, 0);
	if (rc < 0) {
		perror("pwrite");
		close(fd);
		return 1;
	}

	addr[0] = 'B';
	addr[1] = 'U';
	addr[2] = 'G';

	while (1) {
		printf("Press enter to change buffer contents\n");
		getchar();
		addr[3]++;
		printf("Buffer contents changed\n");
	}

	// This is not reached.
	rc = munmap(addr, PAGE_SIZE);
	if (rc < 0) {
		perror("munmap");
		return 1;
	}

	close(fd);
	return 0;
}


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 5/5] ext4/jbd2: debugging messages
  2020-08-10  1:02 ` [RFC PATCH v2 5/5] ext4/jbd2: debugging messages Mauricio Faria de Oliveira
@ 2020-08-10  1:48   ` kernel test robot
  2020-08-10  2:51   ` kernel test robot
  2020-08-19  8:46   ` Jan Kara
  2 siblings, 0 replies; 21+ messages in thread
From: kernel test robot @ 2020-08-10  1:48 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2420 bytes --]

Hi Mauricio,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on ext4/dev]
[also build test ERROR on ext3/for_next linus/master v5.8 next-20200807]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Mauricio-Faria-de-Oliveira/ext4-jbd2-data-journal-write-protect-pages-on-transaction-commit/20200810-090421
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: arm-bcm2835_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=arm 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> fs/ext4/inode.c:51:10: fatal error: ext4_jbd2_dbg.h: No such file or directory
      51 | #include "ext4_jbd2_dbg.h"
         |          ^~~~~~~~~~~~~~~~~
   compilation terminated.
--
>> fs/ext4/super.c:61:10: fatal error: ext4_jbd2_dbg.h: No such file or directory
      61 | #include "ext4_jbd2_dbg.h"
         |          ^~~~~~~~~~~~~~~~~
   compilation terminated.
--
>> fs/jbd2/transaction.c:33:10: fatal error: ../ext4/ext4_jbd2_dbg.h: No such file or directory
      33 | #include "../ext4/ext4_jbd2_dbg.h"
         |          ^~~~~~~~~~~~~~~~~~~~~~~~~
   compilation terminated.
--
>> fs/jbd2/commit.c:32:10: fatal error: ../ext4/ext4_jbd2_dbg.h: No such file or directory
      32 | #include "../ext4/ext4_jbd2_dbg.h"
         |          ^~~~~~~~~~~~~~~~~~~~~~~~~
   compilation terminated.
--
>> fs/jbd2/journal.c:51:10: fatal error: ../ext4/ext4_jbd2_dbg.h: No such file or directory
      51 | #include "../ext4/ext4_jbd2_dbg.h"
         |          ^~~~~~~~~~~~~~~~~~~~~~~~~
   compilation terminated.

vim +51 fs/ext4/inode.c

    50	
  > 51	#include "ext4_jbd2_dbg.h"
    52	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 27568 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 5/5] ext4/jbd2: debugging messages
  2020-08-10  1:02 ` [RFC PATCH v2 5/5] ext4/jbd2: debugging messages Mauricio Faria de Oliveira
  2020-08-10  1:48   ` kernel test robot
@ 2020-08-10  2:51   ` kernel test robot
  2020-08-19  8:46   ` Jan Kara
  2 siblings, 0 replies; 21+ messages in thread
From: kernel test robot @ 2020-08-10  2:51 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 2457 bytes --]

Hi Mauricio,

[FYI, it's a private test report for your RFC patch.]
[auto build test ERROR on ext4/dev]
[also build test ERROR on ext3/for_next linus/master v5.8 next-20200807]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Mauricio-Faria-de-Oliveira/ext4-jbd2-data-journal-write-protect-pages-on-transaction-commit/20200810-090421
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: x86_64-randconfig-a001-20200810 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 3a34228bff6fdf45b50cb57cf5fb85d659eeb672)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> fs/ext4/inode.c:51:10: fatal error: 'ext4_jbd2_dbg.h' file not found
   #include "ext4_jbd2_dbg.h"
            ^~~~~~~~~~~~~~~~~
   1 error generated.
--
>> fs/ext4/super.c:61:10: fatal error: 'ext4_jbd2_dbg.h' file not found
   #include "ext4_jbd2_dbg.h"
            ^~~~~~~~~~~~~~~~~
   1 error generated.
--
>> fs/jbd2/transaction.c:33:10: fatal error: '../ext4/ext4_jbd2_dbg.h' file not found
   #include "../ext4/ext4_jbd2_dbg.h"
            ^~~~~~~~~~~~~~~~~~~~~~~~~
   1 error generated.
--
>> fs/jbd2/commit.c:32:10: fatal error: '../ext4/ext4_jbd2_dbg.h' file not found
   #include "../ext4/ext4_jbd2_dbg.h"
            ^~~~~~~~~~~~~~~~~~~~~~~~~
   1 error generated.
--
>> fs/jbd2/journal.c:51:10: fatal error: '../ext4/ext4_jbd2_dbg.h' file not found
   #include "../ext4/ext4_jbd2_dbg.h"
            ^~~~~~~~~~~~~~~~~~~~~~~~~
   1 error generated.

vim +51 fs/ext4/inode.c

    50	
  > 51	#include "ext4_jbd2_dbg.h"
    52	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 32209 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 1/5] jbd2: test case for ext4 data=journal/mmap() journal corruption
  2020-08-10  1:02 ` [RFC PATCH v2 1/5] jbd2: test case for ext4 data=journal/mmap() journal corruption Mauricio Faria de Oliveira
@ 2020-08-18 14:38   ` Jan Kara
  2020-08-19  1:15     ` Mauricio Faria de Oliveira
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2020-08-18 14:38 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

On Sun 09-08-20 22:02:04, Mauricio Faria de Oliveira wrote:
> This checks during journal commit, right after calculating the
> checksum of a buffer head, whether its contents match the 'BUG'
> string (the cookie string in the test case userspace part.)
> 
> If so, it sleeps 5 seconds for such contents to change (i.e.,
> so that the actual checksum changes from what was calculated.)
> 
> And if it changed, set a flag to panic after committing to disk.
> 
> Then, on filesystem remount/journal recovery there is an invalid
> checksum error, and recovery fails:
> 
>   $ sudo mount -o data=journal,journal_checksum $DEV $MNT
>   [ 100.832223] EXT4-fs: Warning: mounting with data=journal disables
>   delayed allocation, dioread_nolock, and O_DIRECT support!
>   [ 100.837488] JBD2: Invalid checksum recovering data block 8706 in log
>   [ 100.842010] JBD2: recovery failed
>   [ 100.843045] EXT4-fs (loop0): error loading journal
>   mount: /ext4: can't read superblock on /dev/loop0.

Nice to have this for testing but when you'll do some "official"
submission, just send this patch separately so that it's clear shouldn't be
included in the kernel...

								Honza

> ---
>  fs/jbd2/commit.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 6d2da8ad0e6f..51f713089e35 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -26,6 +26,11 @@
>  #include <linux/bitops.h>
>  #include <trace/events/jbd2.h>
>  
> +#include <linux/printk.h>
> +#include <linux/delay.h>
> +
> +static journal_t *force_panic;
> +
>  /*
>   * IO end handler for temporary buffer_heads handling writes to the journal.
>   */
> @@ -331,14 +336,35 @@ static void jbd2_block_tag_csum_set(journal_t *j, journal_block_tag_t *tag,
>  	__u32 csum32;
>  	__be32 seq;
>  
> +	// For the testcase
> +	__u32 csum32_later;
> +	__u8 *bh_data;
> +
>  	if (!jbd2_journal_has_csum_v2or3(j))
>  		return;
>  
>  	seq = cpu_to_be32(sequence);
>  	addr = kmap_atomic(page);
>  	csum32 = jbd2_chksum(j, j->j_csum_seed, (__u8 *)&seq, sizeof(seq));
> +	csum32_later = csum32; // Copy csum32 to check again later
>  	csum32 = jbd2_chksum(j, csum32, addr + offset_in_page(bh->b_data),
>  			     bh->b_size);
> +
> +	// Check for testcase cookie 'BUG' in the buffer_head data.
> +	bh_data = addr + offset_in_page(bh->b_data);
> +	if (bh_data[0] == 'B' &&
> +	    bh_data[1] == 'U' &&
> +	    bh_data[2] == 'G') {
> +		pr_info("TESTCASE: Cookie found. Waiting 5 seconds for changes.\n");
> +		msleep(5000);
> +		pr_info("TESTCASE: Cookie eaten. Resumed.\n");
> +	}
> +
> +	// Check the checksum again for changes/panic after commit.
> +	csum32_later = jbd2_chksum(j, csum32_later, addr + offset_in_page(bh->b_data), bh->b_size);
> +	if (csum32 != csum32_later)
> +		force_panic = j;
> +
>  	kunmap_atomic(addr);
>  
>  	if (jbd2_has_feature_csum3(j))
> @@ -885,6 +911,9 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  		blkdev_issue_flush(journal->j_dev, GFP_NOFS);
>  	}
>  
> +	if (force_panic == journal)
> +		panic("TESTCASE: checksum changed; commit record done; panic!\n");
> +
>  	if (err)
>  		jbd2_journal_abort(journal, err);
>  
> -- 
> 2.17.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 2/5] jbd2: introduce journal callbacks j_submit|finish_inode_data_buffers
  2020-08-10  1:02 ` [RFC PATCH v2 2/5] jbd2: introduce journal callbacks j_submit|finish_inode_data_buffers Mauricio Faria de Oliveira
@ 2020-08-18 14:52   ` Jan Kara
  2020-08-19  1:20     ` Mauricio Faria de Oliveira
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2020-08-18 14:52 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

On Sun 09-08-20 22:02:05, Mauricio Faria de Oliveira wrote:
> Add the callbacks as opt-in to override the default behavior for
> the transaction's inode list, instead of moving that code around.
> 
> This is important as not only ext4 uses the inode list: ocfs2 too,
> via jbd2_journal_inode_ranged_write(), and maybe out-of-tree code.

I'd prefer if the callback is called unconditionally, jbd2 exports the
callback that implements the current behavior and and both ext4 & ocfs2
are adapted to use this callback. We don't care about out of tree code.
That way things are cleaner long term...

> To opt-out of the default behavior (i.e., to do nothing), one has
> to opt-in with a no-op function.

Your Signed-off-by is missing for this patch.

> ---
>  fs/jbd2/commit.c     | 21 ++++++++++++++++-----
>  include/linux/jbd2.h | 21 ++++++++++++++++++++-
>  2 files changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 51f713089e35..b98d227b50d8 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -237,10 +237,14 @@ static int journal_submit_data_buffers(journal_t *journal,
>  		 * instead of writepages. Because writepages can do
>  		 * block allocation  with delalloc. We need to write
>  		 * only allocated blocks here.
> +		 * This can be overriden with a custom callback.
>  		 */
>  		trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
> -		err = journal_submit_inode_data_buffers(mapping, dirty_start,
> -				dirty_end);
> +		if (journal->j_submit_inode_data_buffers)
> +			err = journal->j_submit_inode_data_buffers(jinode);
> +		else
> +			err = journal_submit_inode_data_buffers(mapping,
> +					dirty_start, dirty_end);
>  		if (!ret)
>  			ret = err;
>  		spin_lock(&journal->j_list_lock);
> @@ -274,9 +278,16 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
>  			continue;
>  		jinode->i_flags |= JI_COMMIT_RUNNING;
>  		spin_unlock(&journal->j_list_lock);
> -		err = filemap_fdatawait_range_keep_errors(
> -				jinode->i_vfs_inode->i_mapping, dirty_start,
> -				dirty_end);
> +		/*
> +		 * Wait for the inode data buffers writeout.
> +		 * This can be overriden with a custom callback.
> +		 */
> +		if (journal->j_finish_inode_data_buffers)
> +			err = journal->j_finish_inode_data_buffers(jinode);
> +		else
> +			err = filemap_fdatawait_range_keep_errors(
> +					jinode->i_vfs_inode->i_mapping,
> +					dirty_start, dirty_end);
>  		if (!ret)
>  			ret = err;
>  		spin_lock(&journal->j_list_lock);
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index d56128df2aff..24efe88eda1b 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -628,7 +628,8 @@ struct transaction_s
>  	struct journal_head	*t_shadow_list;
>  
>  	/*
> -	 * List of inodes whose data we've modified in data=ordered mode.
> +	 * List of inodes whose data we've modified in data=ordered mode
> +	 * or whose pages we should write-protect in data=journaled mode.

I'd rather change the comment to generic "List of inodes associated with
the transaction. E.g. ext4 uses this to track inodes in data=ordered and
data=journal mode that need special handling on transaction commit.".

>  	 * [j_list_lock]
>  	 */
>  	struct list_head	t_inode_list;
> @@ -1110,6 +1111,24 @@ struct journal_s
>  	void			(*j_commit_callback)(journal_t *,
>  						     transaction_t *);
>  
> +	/**
> +	 * @j_submit_inode_data_buffers:
> +	 *
> +	 * This function is called before flushing metadata buffers.
> +	 * This overrides the default behavior (writeout data buffers.)
> +	 */

I'd change the comment to:
	 * This function is called for all inodes associated with the
	 * committing transaction marked with JI_WRITE_DATA flag before we
	 * start to write out the transaction to the journal.

> +	int			(*j_submit_inode_data_buffers)
> +					(struct jbd2_inode *);
> +
> +	/**
> +	 * @j_finish_inode_data_buffers:
> +	 *
> +	 * This function is called after flushing metadata buffers.
> +	 * This overrides the default behavior (wait writeout.)
> +	 */

And here:
	 * This function is called for all inodes associated with the
	 * committing transaction marked with JI_WAIT_DATA flag after we
	 * we have written the transaction to the journal but before we
	 * write out the commit block.


> +	int			(*j_finish_inode_data_buffers)
> +					(struct jbd2_inode *);
> +
>  	/*
>  	 * Journal statistics
>  	 */

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 1/5] jbd2: test case for ext4 data=journal/mmap() journal corruption
  2020-08-18 14:38   ` Jan Kara
@ 2020-08-19  1:15     ` Mauricio Faria de Oliveira
  0 siblings, 0 replies; 21+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-08-19  1:15 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

On Tue, Aug 18, 2020 at 11:38 AM Jan Kara <jack@suse.cz> wrote:
>
> On Sun 09-08-20 22:02:04, Mauricio Faria de Oliveira wrote:
> > This checks during journal commit, right after calculating the
> > checksum of a buffer head, whether its contents match the 'BUG'
> > string (the cookie string in the test case userspace part.)
> >
> > If so, it sleeps 5 seconds for such contents to change (i.e.,
> > so that the actual checksum changes from what was calculated.)
> >
> > And if it changed, set a flag to panic after committing to disk.
> >
> > Then, on filesystem remount/journal recovery there is an invalid
> > checksum error, and recovery fails:
> >
> >   $ sudo mount -o data=journal,journal_checksum $DEV $MNT
> >   [ 100.832223] EXT4-fs: Warning: mounting with data=journal disables
> >   delayed allocation, dioread_nolock, and O_DIRECT support!
> >   [ 100.837488] JBD2: Invalid checksum recovering data block 8706 in log
> >   [ 100.842010] JBD2: recovery failed
> >   [ 100.843045] EXT4-fs (loop0): error loading journal
> >   mount: /ext4: can't read superblock on /dev/loop0.
>
> Nice to have this for testing but when you'll do some "official"
> submission, just send this patch separately so that it's clear shouldn't be
> included in the kernel...
>

Yup, absolutely. :) I forgot to make its subject line different as in
other test-case parts.

>                                                                 Honza
>
> > ---
> >  fs/jbd2/commit.c | 29 +++++++++++++++++++++++++++++
> >  1 file changed, 29 insertions(+)
> >
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index 6d2da8ad0e6f..51f713089e35 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -26,6 +26,11 @@
> >  #include <linux/bitops.h>
> >  #include <trace/events/jbd2.h>
> >
> > +#include <linux/printk.h>
> > +#include <linux/delay.h>
> > +
> > +static journal_t *force_panic;
> > +
> >  /*
> >   * IO end handler for temporary buffer_heads handling writes to the journal.
> >   */
> > @@ -331,14 +336,35 @@ static void jbd2_block_tag_csum_set(journal_t *j, journal_block_tag_t *tag,
> >       __u32 csum32;
> >       __be32 seq;
> >
> > +     // For the testcase
> > +     __u32 csum32_later;
> > +     __u8 *bh_data;
> > +
> >       if (!jbd2_journal_has_csum_v2or3(j))
> >               return;
> >
> >       seq = cpu_to_be32(sequence);
> >       addr = kmap_atomic(page);
> >       csum32 = jbd2_chksum(j, j->j_csum_seed, (__u8 *)&seq, sizeof(seq));
> > +     csum32_later = csum32; // Copy csum32 to check again later
> >       csum32 = jbd2_chksum(j, csum32, addr + offset_in_page(bh->b_data),
> >                            bh->b_size);
> > +
> > +     // Check for testcase cookie 'BUG' in the buffer_head data.
> > +     bh_data = addr + offset_in_page(bh->b_data);
> > +     if (bh_data[0] == 'B' &&
> > +         bh_data[1] == 'U' &&
> > +         bh_data[2] == 'G') {
> > +             pr_info("TESTCASE: Cookie found. Waiting 5 seconds for changes.\n");
> > +             msleep(5000);
> > +             pr_info("TESTCASE: Cookie eaten. Resumed.\n");
> > +     }
> > +
> > +     // Check the checksum again for changes/panic after commit.
> > +     csum32_later = jbd2_chksum(j, csum32_later, addr + offset_in_page(bh->b_data), bh->b_size);
> > +     if (csum32 != csum32_later)
> > +             force_panic = j;
> > +
> >       kunmap_atomic(addr);
> >
> >       if (jbd2_has_feature_csum3(j))
> > @@ -885,6 +911,9 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> >               blkdev_issue_flush(journal->j_dev, GFP_NOFS);
> >       }
> >
> > +     if (force_panic == journal)
> > +             panic("TESTCASE: checksum changed; commit record done; panic!\n");
> > +
> >       if (err)
> >               jbd2_journal_abort(journal, err);
> >
> > --
> > 2.17.1
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR



-- 
Mauricio Faria de Oliveira

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 2/5] jbd2: introduce journal callbacks j_submit|finish_inode_data_buffers
  2020-08-18 14:52   ` Jan Kara
@ 2020-08-19  1:20     ` Mauricio Faria de Oliveira
  2020-08-19  9:16       ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-08-19  1:20 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

Hi Jan,

Thanks for reviewing.

On Tue, Aug 18, 2020 at 11:52 AM Jan Kara <jack@suse.cz> wrote:
>
> On Sun 09-08-20 22:02:05, Mauricio Faria de Oliveira wrote:
> > Add the callbacks as opt-in to override the default behavior for
> > the transaction's inode list, instead of moving that code around.
> >
> > This is important as not only ext4 uses the inode list: ocfs2 too,
> > via jbd2_journal_inode_ranged_write(), and maybe out-of-tree code.
>
> I'd prefer if the callback is called unconditionally, jbd2 exports the
> callback that implements the current behavior and and both ext4 & ocfs2
> are adapted to use this callback. We don't care about out of tree code.
> That way things are cleaner long term...

Understood.

>
> > To opt-out of the default behavior (i.e., to do nothing), one has
> > to opt-in with a no-op function.
>
> Your Signed-off-by is missing for this patch.

Oh, I thought it wasn't needed in RFC patches.

Thanks for the suggestions below; they're more precise and descriptive.

I had a few questions in the cover letter, but in case you didn't have
the time, I'll just try harder on them; no worries.

Kind regards,
Mauricio

>
> > ---
> >  fs/jbd2/commit.c     | 21 ++++++++++++++++-----
> >  include/linux/jbd2.h | 21 ++++++++++++++++++++-
> >  2 files changed, 36 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index 51f713089e35..b98d227b50d8 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -237,10 +237,14 @@ static int journal_submit_data_buffers(journal_t *journal,
> >                * instead of writepages. Because writepages can do
> >                * block allocation  with delalloc. We need to write
> >                * only allocated blocks here.
> > +              * This can be overriden with a custom callback.
> >                */
> >               trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
> > -             err = journal_submit_inode_data_buffers(mapping, dirty_start,
> > -                             dirty_end);
> > +             if (journal->j_submit_inode_data_buffers)
> > +                     err = journal->j_submit_inode_data_buffers(jinode);
> > +             else
> > +                     err = journal_submit_inode_data_buffers(mapping,
> > +                                     dirty_start, dirty_end);
> >               if (!ret)
> >                       ret = err;
> >               spin_lock(&journal->j_list_lock);
> > @@ -274,9 +278,16 @@ static int journal_finish_inode_data_buffers(journal_t *journal,
> >                       continue;
> >               jinode->i_flags |= JI_COMMIT_RUNNING;
> >               spin_unlock(&journal->j_list_lock);
> > -             err = filemap_fdatawait_range_keep_errors(
> > -                             jinode->i_vfs_inode->i_mapping, dirty_start,
> > -                             dirty_end);
> > +             /*
> > +              * Wait for the inode data buffers writeout.
> > +              * This can be overriden with a custom callback.
> > +              */
> > +             if (journal->j_finish_inode_data_buffers)
> > +                     err = journal->j_finish_inode_data_buffers(jinode);
> > +             else
> > +                     err = filemap_fdatawait_range_keep_errors(
> > +                                     jinode->i_vfs_inode->i_mapping,
> > +                                     dirty_start, dirty_end);
> >               if (!ret)
> >                       ret = err;
> >               spin_lock(&journal->j_list_lock);
> > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > index d56128df2aff..24efe88eda1b 100644
> > --- a/include/linux/jbd2.h
> > +++ b/include/linux/jbd2.h
> > @@ -628,7 +628,8 @@ struct transaction_s
> >       struct journal_head     *t_shadow_list;
> >
> >       /*
> > -      * List of inodes whose data we've modified in data=ordered mode.
> > +      * List of inodes whose data we've modified in data=ordered mode
> > +      * or whose pages we should write-protect in data=journaled mode.
>
> I'd rather change the comment to generic "List of inodes associated with
> the transaction. E.g. ext4 uses this to track inodes in data=ordered and
> data=journal mode that need special handling on transaction commit.".
>
> >        * [j_list_lock]
> >        */
> >       struct list_head        t_inode_list;
> > @@ -1110,6 +1111,24 @@ struct journal_s
> >       void                    (*j_commit_callback)(journal_t *,
> >                                                    transaction_t *);
> >
> > +     /**
> > +      * @j_submit_inode_data_buffers:
> > +      *
> > +      * This function is called before flushing metadata buffers.
> > +      * This overrides the default behavior (writeout data buffers.)
> > +      */
>
> I'd change the comment to:
>          * This function is called for all inodes associated with the
>          * committing transaction marked with JI_WRITE_DATA flag before we
>          * start to write out the transaction to the journal.
>
> > +     int                     (*j_submit_inode_data_buffers)
> > +                                     (struct jbd2_inode *);
> > +
> > +     /**
> > +      * @j_finish_inode_data_buffers:
> > +      *
> > +      * This function is called after flushing metadata buffers.
> > +      * This overrides the default behavior (wait writeout.)
> > +      */
>
> And here:
>          * This function is called for all inodes associated with the
>          * committing transaction marked with JI_WAIT_DATA flag after we
>          * we have written the transaction to the journal but before we
>          * write out the commit block.
>
>
> > +     int                     (*j_finish_inode_data_buffers)
> > +                                     (struct jbd2_inode *);
> > +
> >       /*
> >        * Journal statistics
> >        */
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR



-- 
Mauricio Faria de Oliveira

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 3/5] ext4: data=journal: write-protect pages on submit inode data buffers callback
  2020-08-10  1:02 ` [RFC PATCH v2 3/5] ext4: data=journal: write-protect pages on submit inode data buffers callback Mauricio Faria de Oliveira
@ 2020-08-19  8:44   ` Jan Kara
  2020-08-19 10:41     ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2020-08-19  8:44 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

On Sun 09-08-20 22:02:06, Mauricio Faria de Oliveira wrote:
> This implements the journal's j_submit_inode_data_buffers() callback
> to write-protect the inode's pages with write_cache_pages(), and use
> a writepage callback to redirty pages with buffers that are not part
> of the committing transaction or the next transaction.
> 
> And set a no-op function as j_finish_inode_data_buffers() callback
> (nothing needed other than the write-protect above.)
> 
> Currently, the inode is added to the transaction's inode list in the
> __ext4_journalled_writepage() function.
> ---
>  fs/ext4/inode.c |  4 +++
>  fs/ext4/super.c | 65 +++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 69 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 10dd470876b3..978ccde8454f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1911,6 +1911,10 @@ static int __ext4_journalled_writepage(struct page *page,
>  		err = ext4_walk_page_buffers(handle, page_bufs, 0, len, NULL,
>  					     write_end_fn);
>  	}
> +	if (ret == 0)
> +		ret = err;
> +	// XXX: is this correct for inline data inodes?

Inodes with inline data should never get here. The thing is that when a
file with inline data is faulted for writing, ext4_page_mkwrite() converts
the file to normal data format. And ordinary write(2) will directly update
the inline data and clear the page dirty bit so writepage isn't called for
it.

> +	err = ext4_jbd2_inode_add_write(handle, inode, 0, len);
>  	if (ret == 0)
>  		ret = err;
>  	EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 330957ed1f05..38aaac6572ea 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -472,6 +472,66 @@ static void ext4_journal_commit_callback(journal_t *journal, transaction_t *txn)
>  	spin_unlock(&sbi->s_md_lock);
>  }
>  
> +/*
> + * This writepage callback for write_cache_pages()
> + * takes care of a few cases after page cleaning.
> + *
> + * write_cache_pages() already checks for dirty pages
> + * and calls clear_page_dirty_for_io(), which we want,
> + * to write protect the pages.
> + *
> + * However, we have to redirty a page in two cases:
> + * 1) some buffer is not part of the committing transaction
> + * 2) some buffer already has b_next_transaction set
> + */
> +
> +static int ext4_journalled_writepage_callback(struct page *page,
> +					      struct writeback_control *wbc,
> +					      void *data)
> +{
> +	transaction_t *transaction = (transaction_t *) data;
> +	struct buffer_head *bh, *head;
> +	struct journal_head *jh;
> +
> +	// XXX: any chance of !bh here?

No. In ext4, whenever a page is dirty, it should have buffers attached.

> +	bh = head = page_buffers(page);
> +	do {
> +		jh = bh2jh(bh);
> +		if (!jh || jh->b_transaction != transaction ||
> +		    jh->b_next_transaction) {
> +			redirty_page_for_writepage(wbc, page);
> +			goto out;
> +		}
> +	} while ((bh = bh->b_this_page) != head);
> +
> +out:
> +	return AOP_WRITEPAGE_ACTIVATE;
> +}
> +
> +static int ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode)
> +{
> +	struct address_space *mapping = jinode->i_vfs_inode->i_mapping;
> +	transaction_t *transaction = jinode->i_transaction;
> +	loff_t dirty_start = jinode->i_dirty_start;
> +	loff_t dirty_end = jinode->i_dirty_end;
> +
> +	struct writeback_control wbc = {
> +		.sync_mode =  WB_SYNC_ALL,
> +		.nr_to_write = mapping->nrpages * 2,

For WB_SYNC_ALL writeback, .nr_to_write is mostly ignored so just set it to
~0ULL.

> +		.range_start = dirty_start,
> +		.range_end = dirty_end,
> +        };
> +
> +	return write_cache_pages(mapping, &wbc,
> +				 ext4_journalled_writepage_callback,
> +				 transaction);

I was thinking about this and we may need to do this somewhat differently.
I've realized that there's the slight trouble that we now use page dirty
bit for two purposes in data=journal mode - to track pages that need write
protection during commit and also to track pages which have buffers that
need checkpointing. And this mixing is making things complex. So I was
thinking that we could simply leave PageDirty bit for checkpointing
purposes and always make sure buffers are appropriately attached to a
transaction as dirty in ext4_page_mkwrite(). This will make mmap writes in
data=journal mode somewhat less efficient (all the pages written through
mmap while transaction T was running will be written to the journal during
transaction T commit while currently, we write only pages that also went
through __ext4_journalled_writepage() while T was running which usually
happens less frequently). But the code should be simpler and we don't care
about mmap write performance for data=journal mode much. Furthermore I
don't think that the tricks with PageChecked logic we play in data=journal
mode are really needed as well which should bring further simplifications.
I'll try to code this cleanup.

Then in ext4_journalled_submit_inode_data_buffers() we would need to walk
all the pages in the range describe by jinode and call page_mkclean() for
each page which has buffer attached to the committing transaction.

> +}
> +
> +static int ext4_journalled_finish_inode_data_buffers(struct jbd2_inode *jinode)
> +{
> +	return 0;
> +}
> +
>  static bool system_going_down(void)
>  {
>  	return system_state == SYSTEM_HALT || system_state == SYSTEM_POWER_OFF
> @@ -4599,6 +4659,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  		ext4_msg(sb, KERN_ERR, "can't mount with "
>  			"journal_async_commit in data=ordered mode");
>  		goto failed_mount_wq;
> +	} else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA) {
> +		sbi->s_journal->j_submit_inode_data_buffers =
> +			ext4_journalled_submit_inode_data_buffers;
> +		sbi->s_journal->j_finish_inode_data_buffers =
> +			ext4_journalled_finish_inode_data_buffers;

We can journal data only for individual inodes (when inode has journal_data
flag set). So you have to set the callback unconditionally here and only in
the callback decide what to do with the particular inode based on what
ext4_should_journal_data() tells about the inode.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 5/5] ext4/jbd2: debugging messages
  2020-08-10  1:02 ` [RFC PATCH v2 5/5] ext4/jbd2: debugging messages Mauricio Faria de Oliveira
  2020-08-10  1:48   ` kernel test robot
  2020-08-10  2:51   ` kernel test robot
@ 2020-08-19  8:46   ` Jan Kara
  2 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2020-08-19  8:46 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

On Sun 09-08-20 22:02:08, Mauricio Faria de Oliveira wrote:
> For code tracking and deubgging purposes; some used in cover letter.

Again, this is nice for debugging but for official submission, keep this
patch separate because it should not get merged...

								Honza
> 
> ---
>  fs/ext4/inode.c       | 27 +++++++++++++++++++++++++++
>  fs/ext4/super.c       | 10 ++++++++++
>  fs/jbd2/commit.c      |  5 +++++
>  fs/jbd2/journal.c     |  5 +++++
>  fs/jbd2/transaction.c |  4 ++++
>  5 files changed, 51 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index ce5464f92a7e..cd01aec87303 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -48,6 +48,17 @@
>  
>  #include <trace/events/ext4.h>
>  
> +#include "ext4_jbd2_dbg.h"
> +
> +static int ext4_bh_tdbg(handle_t *handle, struct buffer_head *bh)
> +{
> +	struct super_block *sb = bh->b_page->mapping->host->i_sb;
> +	tdbg_ext4(sb, "bh: %px, data offset: %04llx, page: %px, inode: %px\n",
> +		  bh, (u64) bh->b_data & ((u64)PAGE_SIZE - 1),
> +		  bh->b_page, bh->b_page->mapping->host);
> +	return 0;
> +}
> +
>  static __u32 ext4_inode_csum(struct inode *inode, struct ext4_inode *raw,
>  			      struct ext4_inode_info *ei)
>  {
> @@ -1193,6 +1204,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>  		ret = __block_write_begin(page, pos, len, ext4_get_block);
>  #endif
>  	if (!ret && ext4_should_journal_data(inode)) {
> +		tdbg_ext4(inode->i_sb, "journal started: inode %px, txn %px", inode, handle->h_transaction);
>  		ret = ext4_walk_page_buffers(handle, page_buffers(page),
>  					     from, to, NULL,
>  					     do_journal_get_write_access);
> @@ -1446,6 +1458,7 @@ static int ext4_journalled_write_end(struct file *file,
>  			ext4_orphan_del(NULL, inode);
>  	}
>  
> +	tdbg_ext4(inode->i_sb, "journal stopped: inode %px", inode);
>  	return ret ? ret : copied;
>  }
>  
> @@ -1917,6 +1930,10 @@ static int __ext4_journalled_writepage(struct page *page,
>  	err = ext4_jbd2_inode_add_write(handle, inode, 0, len);
>  	if (ret == 0)
>  		ret = err;
> +	{
> +		struct super_block *sb = handle->h_transaction->t_journal->j_private;
> +		tdbg_ext4(sb, "Added inode to txn list: inode %px, txn = %px, err = %d", inode, handle->h_transaction, err);
> +	}
>  	EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
>  	err = ext4_journal_stop(handle);
>  	if (!ret)
> @@ -2035,6 +2052,7 @@ static int ext4_writepage(struct page *page,
>  		keep_towrite = true;
>  	}
>  
> +	tdbg_ext4(inode->i_sb, "called for inode %px by comm %s", inode, current->comm);
>  	if (PageChecked(page) && ext4_should_journal_data(inode))
>  		/*
>  		 * It's mmapped pagecache.  Add buffers and journal it.  There
> @@ -5969,6 +5987,8 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  	get_block_t *get_block;
>  	int retries = 0;
>  
> +	tdbg_ext4(inode->i_sb, "entry for inode %px", inode);
> +
>  	if (unlikely(IS_IMMUTABLE(inode)))
>  		return VM_FAULT_SIGBUS;
>  
> @@ -6006,6 +6026,9 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  		len = size & ~PAGE_MASK;
>  	else
>  		len = PAGE_SIZE;
> +
> +	ext4_walk_page_buffers(NULL, page_buffers(page), 0, len, NULL, ext4_bh_tdbg);
> +
>  	/*
>  	 * Return if we have all the buffers mapped. This avoids the need to do
>  	 * journal_start/journal_stop which can block and take a long time. But
> @@ -6018,6 +6041,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  			/* Wait so that we don't change page under IO */
>  			wait_for_stable_page(page);
>  			ret = VM_FAULT_LOCKED;
> +			tdbg_ext4(inode->i_sb, "returning; all buffers mapped for inode %px", inode);
>  			goto out;
>  		}
>  	}
> @@ -6036,6 +6060,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  	}
>  	err = block_page_mkwrite(vma, vmf, get_block);
>  	if (!err && ext4_should_journal_data(inode)) {
> +		tdbg_ext4(inode->i_sb, "before djgwa(), for inode %px", inode);
>  		if (ext4_walk_page_buffers(handle, page_buffers(page), 0,
>  			  PAGE_SIZE, NULL, do_journal_get_write_access)) {
>  			unlock_page(page);
> @@ -6043,6 +6068,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  			ext4_journal_stop(handle);
>  			goto out;
>  		}
> +		tdbg_ext4(inode->i_sb, "after  djgwa(), for inode %px", inode);
>  		ext4_set_inode_state(inode, EXT4_STATE_JDATA);
>  		if (ext4_jbd2_inode_add_write(handle, inode, 0, PAGE_SIZE)) {
>  			unlock_page(page);
> @@ -6050,6 +6076,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
>  			ext4_journal_stop(handle);
>  			goto out;
>  		}
> +		tdbg_ext4(inode->i_sb, "Added inode to txn list: inode %px, txn = %px, err = 0", inode, handle->h_transaction);
>  	}
>  	ext4_journal_stop(handle);
>  	if (err == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 38aaac6572ea..7167fcf60b5c 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -58,6 +58,8 @@
>  #define CREATE_TRACE_POINTS
>  #include <trace/events/ext4.h>
>  
> +#include "ext4_jbd2_dbg.h"
> +
>  static struct ext4_lazy_init *ext4_li_info;
>  static struct mutex ext4_li_mtx;
>  static struct ratelimit_state ext4_mount_msg_ratelimit;
> @@ -492,14 +494,20 @@ static int ext4_journalled_writepage_callback(struct page *page,
>  	transaction_t *transaction = (transaction_t *) data;
>  	struct buffer_head *bh, *head;
>  	struct journal_head *jh;
> +	struct super_block *sb = page->mapping->host->i_sb;
>  
>  	// XXX: any chance of !bh here?
>  	bh = head = page_buffers(page);
> +	tdbg_ext4(sb, "entry for bh %px, page %px, inode: %px", bh, page, page->mapping->host);
>  	do {
>  		jh = bh2jh(bh);
>  		if (!jh || jh->b_transaction != transaction ||
>  		    jh->b_next_transaction) {
>  			redirty_page_for_writepage(wbc, page);
> +			tdbg_ext4(sb, "redirty for bh %px, jh, %px, txn %px, next_txn %px",
> +				  bh, jh,
> +				  jh ? jh->b_transaction : NULL,
> +				  jh ? jh->b_next_transaction : NULL);
>  			goto out;
>  		}
>  	} while ((bh = bh->b_this_page) != head);
> @@ -522,6 +530,7 @@ static int ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode)
>  		.range_end = dirty_end,
>          };
>  
> +	tdbg_ext4(jinode->i_vfs_inode->i_sb, "entry for inode: %px", jinode->i_vfs_inode);
>  	return write_cache_pages(mapping, &wbc,
>  				 ext4_journalled_writepage_callback,
>  				 transaction);
> @@ -529,6 +538,7 @@ static int ext4_journalled_submit_inode_data_buffers(struct jbd2_inode *jinode)
>  
>  static int ext4_journalled_finish_inode_data_buffers(struct jbd2_inode *jinode)
>  {
> +	tdbg_ext4(jinode->i_vfs_inode->i_sb, "entry for inode: %px", jinode->i_vfs_inode);
>  	return 0;
>  }
>  
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index b98d227b50d8..96f0d81eadf9 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -29,6 +29,8 @@
>  #include <linux/printk.h>
>  #include <linux/delay.h>
>  
> +#include "../ext4/ext4_jbd2_dbg.h"
> +
>  static journal_t *force_panic;
>  
>  /*
> @@ -222,11 +224,14 @@ static int journal_submit_data_buffers(journal_t *journal,
>  	int err, ret = 0;
>  	struct address_space *mapping;
>  
> +	tdbg_jbd2(journal, "entry for transaction: 0x%px\n", commit_transaction);
>  	spin_lock(&journal->j_list_lock);
>  	list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
>  		loff_t dirty_start = jinode->i_dirty_start;
>  		loff_t dirty_end = jinode->i_dirty_end;
>  
> +		tdbg_jbd2(journal, "txn list has inode %px (write data flag: 0x%lx)\n", jinode->i_vfs_inode, (jinode->i_flags & JI_WRITE_DATA));
> +
>  		if (!(jinode->i_flags & JI_WRITE_DATA))
>  			continue;
>  		mapping = jinode->i_vfs_inode->i_mapping;
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index e4944436e733..b86b871ee823 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -48,6 +48,8 @@
>  #include <linux/uaccess.h>
>  #include <asm/page.h>
>  
> +#include "../ext4/ext4_jbd2_dbg.h"
> +
>  #ifdef CONFIG_JBD2_DEBUG
>  ushort jbd2_journal_enable_debug __read_mostly;
>  EXPORT_SYMBOL(jbd2_journal_enable_debug);
> @@ -453,6 +455,9 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
>  
>  	*bh_out = new_bh;
>  
> +	tdbg_jbd2(transaction->t_journal, "copy out: done/need %d/%d, bh: %px, offset: %04x, page: %px, inode: %px\n",
> +	     done_copy_out, need_copy_out, jh2bh(jh_in), new_offset, new_page, new_page->mapping ? new_page->mapping->host : NULL);
> +
>  	/*
>  	 * The to-be-written buffer needs to get moved to the io queue,
>  	 * and the original buffer whose contents we are shadowing or
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index e91aad3637a2..93a55a228e08 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -30,6 +30,8 @@
>  
>  #include <trace/events/jbd2.h>
>  
> +#include "../ext4/ext4_jbd2_dbg.h"
> +
>  static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh);
>  static void __jbd2_journal_unfile_buffer(struct journal_head *jh);
>  
> @@ -952,6 +954,8 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  repeat:
>  	bh = jh2bh(jh);
>  
> +	tdbg_jbd2(journal, "entry for bh: %px, offset: %04lx, page %px, inode: %px",
> +		  bh, offset_in_page(bh->b_data), bh->b_page, bh->b_page->mapping->host);
>  	/* @@@ Need to check for errors here at some point. */
>  
>   	start_lock = jiffies;
> -- 
> 2.17.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 2/5] jbd2: introduce journal callbacks j_submit|finish_inode_data_buffers
  2020-08-19  1:20     ` Mauricio Faria de Oliveira
@ 2020-08-19  9:16       ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2020-08-19  9:16 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

On Tue 18-08-20 22:20:08, Mauricio Faria de Oliveira wrote:
> > > To opt-out of the default behavior (i.e., to do nothing), one has
> > > to opt-in with a no-op function.
> >
> > Your Signed-off-by is missing for this patch.
> 
> Oh, I thought it wasn't needed in RFC patches.

Yes, it's not strictly needed if you don't want patches included yet. But
usually they are present so that people have less things to worry about
when preparing final submission :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit
  2020-08-10  1:02 [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
                   ` (6 preceding siblings ...)
  2020-08-10  1:02 ` [RFC PATCH v2/TEST CASE] Mauricio Faria de Oliveira
@ 2020-08-19  9:27 ` Jan Kara
  7 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2020-08-19  9:27 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

Hi Mauricio!

On Sun 09-08-20 22:02:03, Mauricio Faria de Oliveira wrote:
> I'd like to ask for your comments on the patchset so far, and have a few
> questions.

Thanks for taking time to write these patches! I have commented on the
patches and now let's address the remaining questions :)

> Third, my changes/observations/questions:
> 
> 1) > Probably the best place to add inode to this list is
> ext4_journalled_writepage().
> 
> I think we also need it in ext4_page_mkwrite(), right?  See the test
> case, for example:
> 
>     fd = open("file");
>     addr = mmap(fd);
>     pwrite(fd, "a", 1, 0);
>     addr[0] = 'a';
>     press_enter(); // when asked for.
>     addr[0] = 'b';
> 
> Here pwrite() led to ext4_write_begin() which added the inode to a
> transaction (not to the transaction's inode list).

Hum, right. Good spotting, I didn't think of this possibility. Actually,
when we have to do this in ext4_page_mkwrite() we don't need
ext4_journalled_writepage() at all which is what I've suggested when
commenting on your patch 3/5 anyway to simplify things :).

> 2) Do not return early in ext4_page_mkwrite() 'if we have all the buffers mapped'?
> 
> There's this check in ext4_page_mkwrite():
> 
>         /*
>          * Return if we have all the buffers mapped. This avoids the need to do
>          * journal_start/journal_stop which can block and take a long time
>          */
> 
> Corresponding to this line in the debug messages:
> 
>     [   20.677639] TESTDBG: ext4_page_mkwrite() :: returning; all buffers mapped for inode ffff8f72eda5fca8
> 
> That returns early, *before* do_journal_get_write_access(), which is
> needed in data=journal, regardless of whether all buffers are mapped: we
> could exit ext4_page_mkwrite() then change buffers under commit during
> the race window.
>     
> So, it seems we should ignore that for data=journal, which allows us to
> ext4_journal_start() and get a transaction handle, used to add the inode
> to the transaction's inode list (per #1.)

Yes, correct.

> With the changes from 1) and 2), the inode is added to the transaction's
> inode list, and the callback does write-protect the page correctly.
> 
> When trying to change the buffer contents later (pressing enter) there
> _is_ a wait in ext4_page_mkwrite() as expected!  But it's first at the
> file_update_time() because it calls do_get_write_access() for that inode,
> which is also in the transaction.
> 
> I presume that a commit can start after file_update_time() and before
> ext4_journal_start() in ext4_page_mkwrite(), so we'd still have to keep
> these changes to ext4_page_mkwrite() ?

Yes, relying on file_update_time() waiting is definitely too fragile.

> 3) When checking to redirty the page in the writepage callback,
>    does a buffer without a journal head means we should redirty
>    the page? (for the reason it's not part of the committing txn)

This is going to change after the simplifications I've suggested so there's
no need to worry about this now.

> 4) Should we clear the PageChecked bit?
> 
> We don't clear the PageChecked bit, thus later when the writeback
> work kicks in, __ext4_journalled_writepage() adds the page to the
> transaction's inode list again.
> 
> Since the page is clean, it just goes into the submit data buffers
> callback, but write_cache_pages() returns before the writepage callback.
> 
> Should we try to provent that by, say, clearing the pagechecked bit
> in case we don't have to redirty the page (in the writepage callback) ?

I guess it's difficult to determine when PageChecked bit can be safely
cleared. And I plan to do away with PageChecked bit usage completely in my
cleanup so this should go away anyway...

> 5) Inodes with inline data.  I haven't checked in detail yet, but
> would appreciate if you have a brief explanation about them, and
> if they need special handling since they apparently don't get
> do_journal_get_write_access() called (eg, see __ext4_journalled_writepage()).

I think I've commented about this in a particular patch.

									Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 3/5] ext4: data=journal: write-protect pages on submit inode data buffers callback
  2020-08-19  8:44   ` Jan Kara
@ 2020-08-19 10:41     ` Jan Kara
  2020-08-20 22:55       ` Mauricio Faria de Oliveira
  0 siblings, 1 reply; 21+ messages in thread
From: Jan Kara @ 2020-08-19 10:41 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira
  Cc: Jan Kara, linux-ext4, dann frazier, Mauricio Faria de Oliveira, Jan Kara

On Wed 19-08-20 10:44:21, Jan Kara wrote:
> I was thinking about this and we may need to do this somewhat differently.
> I've realized that there's the slight trouble that we now use page dirty
> bit for two purposes in data=journal mode - to track pages that need write
> protection during commit and also to track pages which have buffers that
> need checkpointing. And this mixing is making things complex. So I was
> thinking that we could simply leave PageDirty bit for checkpointing
> purposes and always make sure buffers are appropriately attached to a
> transaction as dirty in ext4_page_mkwrite(). This will make mmap writes in
> data=journal mode somewhat less efficient (all the pages written through
> mmap while transaction T was running will be written to the journal during
> transaction T commit while currently, we write only pages that also went
> through __ext4_journalled_writepage() while T was running which usually
> happens less frequently). But the code should be simpler and we don't care
> about mmap write performance for data=journal mode much. Furthermore I
> don't think that the tricks with PageChecked logic we play in data=journal
> mode are really needed as well which should bring further simplifications.
> I'll try to code this cleanup.

I was looking more into this but it isn't as simple as I thought because
get_user_pages() users can still modify data and call set_page_dirty() when
the page is no longer writeably mapped. And by the time set_page_dirty() is
called page buffers are not necessarily part of any transaction so we need
to do effectively what's in ext4_journalled_writepage(). To handle this
corner case I didn't find anything considerably simpler than the current
code.

So let's stay with what we have in
ext4_journalled_submit_inode_data_buffers(), we just have to also redirty
the page if we find any dirty buffers.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 3/5] ext4: data=journal: write-protect pages on submit inode data buffers callback
  2020-08-19 10:41     ` Jan Kara
@ 2020-08-20 22:55       ` Mauricio Faria de Oliveira
  2020-08-21 10:26         ` Jan Kara
  0 siblings, 1 reply; 21+ messages in thread
From: Mauricio Faria de Oliveira @ 2020-08-20 22:55 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-ext4, dann frazier, Jan Kara

Hi Jan,

Thanks a bunch for the detailed comments, including in the cover letter.
Your attention and patience for explanations are really appreciated.

I _think_ I got most of it for the next iteration -- a few follow up questions:

On Wed, Aug 19, 2020 at 7:41 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 19-08-20 10:44:21, Jan Kara wrote:
> > I was thinking about this and we may need to do this somewhat differently.
> > I've realized that there's the slight trouble that we now use page dirty
> > bit for two purposes in data=journal mode - to track pages that need write
> > protection during commit and also to track pages which have buffers that
> > need checkpointing. And this mixing is making things complex. So I was
> > thinking that we could simply leave PageDirty bit for checkpointing
> > purposes and always make sure buffers are appropriately attached to a
> > transaction as dirty in ext4_page_mkwrite(). [snip]
> > [snip] Furthermore I
> > don't think that the tricks with PageChecked logic we play in data=journal
> > mode are really needed as well which should bring further simplifications.
> > I'll try to code this cleanup.
>
> I was looking more into this but it isn't as simple as I thought because
> get_user_pages() users can still modify data and call set_page_dirty() when
> the page is no longer writeably mapped. And by the time set_page_dirty() is
> called page buffers are not necessarily part of any transaction so we need
> to do effectively what's in ext4_journalled_writepage(). To handle this
> corner case I didn't find anything considerably simpler than the current
> code.
>
> So let's stay with what we have in
> ext4_journalled_submit_inode_data_buffers(), we just have to also redirty
> the page if we find any dirty buffers.
>

Could you please clarify/comment whether the dirty buffers "flags" are
different between
the suggestions for ext4_page_mkwrite() and
ext4_journalled_submit_inode_data_buffers() ?

I'm asking because..

In ext4_page_mkwrite() the suggestion is to attach buffers as dirty to
a transaction, which I guess can be done with
ext4_walk_page_buffers(..., write_end_fn) after
ext4_walk_page_buffers(..., do_journal_get_write_access) -- just as
done in ext4_journalled_writepage() -- and that sets the buffer as
*jbd* dirty (BH_JBDDirty.)

In ext4_journalled_submit_inode_data_buffers() the suggestion is to
check for dirty buffers to redirty the page
(for the case of buffers that need checkpointing) and I think this is
the non-jbd/just dirty (BH_Dirty.)

If I actually understood your explanation/suggest, the dirty buffer
flags should be different,
as otherwise we'd be unconditionally setting buffers dirty on
ext4_page_mkwrite() to later
check for (known to be) dirty buffers in
ext4_journalled_submit_inode_data_buffers().

...

And as you mentioned no cleanup / keeping ext4_journalled_writepage()
and the PageChecked bit,
I would like to revisit two questions from the cover letter that would
have no impact with the cleanup,
so to confirm my understanding for the next steps.

    > 3) When checking to redirty the page in the writepage callback,
    >    does a buffer without a journal head means we should redirty
    >    the page? (for the reason it's not part of the committing txn)

Per your explanation about the page dirty bit for buffers that need
checkpointing, I see we cannot redirty
the page just because a buffer isn't part of the transaction -- the
buffer has to be dirty -- so I think it falls
down to your suggestion of 'also redirty if we find any dirty buffers'
(regardless of a buffer w/out txns.) right?

    > 4) Should we clear the PageChecked bit?
    ...
    > Should we try to prevent that [ext4_journalled_writepage()
running later] by, say, clearing the pagechecked bit
    > in case we don't have to redirty the page (in the writepage callback) ?

And I think the answer is no, per your explanation about page dirty
being set elsewhere outside of our control,
and thus ext4_journalled_page() still needs to run, and thus the  page
checked bit still needs to remain set; correct?

Thanks again,




>                                                                 Honza
>
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR



--
Mauricio Faria de Oliveira

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [RFC PATCH v2 3/5] ext4: data=journal: write-protect pages on submit inode data buffers callback
  2020-08-20 22:55       ` Mauricio Faria de Oliveira
@ 2020-08-21 10:26         ` Jan Kara
  0 siblings, 0 replies; 21+ messages in thread
From: Jan Kara @ 2020-08-21 10:26 UTC (permalink / raw)
  To: Mauricio Faria de Oliveira; +Cc: Jan Kara, linux-ext4, dann frazier, Jan Kara

On Thu 20-08-20 19:55:05, Mauricio Faria de Oliveira wrote:
> On Wed, Aug 19, 2020 at 7:41 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Wed 19-08-20 10:44:21, Jan Kara wrote:
> > > I was thinking about this and we may need to do this somewhat differently.
> > > I've realized that there's the slight trouble that we now use page dirty
> > > bit for two purposes in data=journal mode - to track pages that need write
> > > protection during commit and also to track pages which have buffers that
> > > need checkpointing. And this mixing is making things complex. So I was
> > > thinking that we could simply leave PageDirty bit for checkpointing
> > > purposes and always make sure buffers are appropriately attached to a
> > > transaction as dirty in ext4_page_mkwrite(). [snip]
> > > [snip] Furthermore I
> > > don't think that the tricks with PageChecked logic we play in data=journal
> > > mode are really needed as well which should bring further simplifications.
> > > I'll try to code this cleanup.
> >
> > I was looking more into this but it isn't as simple as I thought because
> > get_user_pages() users can still modify data and call set_page_dirty() when
> > the page is no longer writeably mapped. And by the time set_page_dirty() is
> > called page buffers are not necessarily part of any transaction so we need
> > to do effectively what's in ext4_journalled_writepage(). To handle this
> > corner case I didn't find anything considerably simpler than the current
> > code.
> >
> > So let's stay with what we have in
> > ext4_journalled_submit_inode_data_buffers(), we just have to also redirty
> > the page if we find any dirty buffers.
> >
> 
> Could you please clarify/comment whether the dirty buffers "flags" are
> different between the suggestions for ext4_page_mkwrite() and
> ext4_journalled_submit_inode_data_buffers() ?
> 
> I'm asking because..
> 
> In ext4_page_mkwrite() the suggestion is to attach buffers as dirty to
> a transaction, which I guess can be done with
> ext4_walk_page_buffers(..., write_end_fn) after
> ext4_walk_page_buffers(..., do_journal_get_write_access) -- just as
> done in ext4_journalled_writepage() -- and that sets the buffer as
> *jbd* dirty (BH_JBDDirty.)

Correct.

> In ext4_journalled_submit_inode_data_buffers() the suggestion is to
> check for dirty buffers to redirty the page
> (for the case of buffers that need checkpointing) and I think this is
> the non-jbd/just dirty (BH_Dirty.)

Again correct :).

> If I actually understood your explanation/suggest, the dirty buffer
> flags should be different,
> as otherwise we'd be unconditionally setting buffers dirty on
> ext4_page_mkwrite() to later
> check for (known to be) dirty buffers in
> ext4_journalled_submit_inode_data_buffers().
> 
> ...
> 
> And as you mentioned no cleanup / keeping ext4_journalled_writepage()
> and the PageChecked bit,
> I would like to revisit two questions from the cover letter that would
> have no impact with the cleanup,
> so to confirm my understanding for the next steps.
> 
>     > 3) When checking to redirty the page in the writepage callback,
>     >    does a buffer without a journal head means we should redirty
>     >    the page? (for the reason it's not part of the committing txn)
> 
> Per your explanation about the page dirty bit for buffers that need
> checkpointing, I see we cannot redirty
> the page just because a buffer isn't part of the transaction -- the
> buffer has to be dirty -- so I think it falls
> down to your suggestion of 'also redirty if we find any dirty buffers'
> (regardless of a buffer w/out txns.) right?

Correct. It should be:

		if (buffer_dirty(bh) || (jh && ...))
			redirty
 
>     > 4) Should we clear the PageChecked bit?
>     ...
>     > Should we try to prevent that [ext4_journalled_writepage()
> running later] by, say, clearing the pagechecked bit
>     > in case we don't have to redirty the page (in the writepage callback) ?
> 
> And I think the answer is no, per your explanation about page dirty
> being set elsewhere outside of our control,
> and thus ext4_journalled_page() still needs to run, and thus the  page
> checked bit still needs to remain set; correct?

Correct.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2020-08-21 10:26 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-10  1:02 [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit Mauricio Faria de Oliveira
2020-08-10  1:02 ` [RFC PATCH v2 1/5] jbd2: test case for ext4 data=journal/mmap() journal corruption Mauricio Faria de Oliveira
2020-08-18 14:38   ` Jan Kara
2020-08-19  1:15     ` Mauricio Faria de Oliveira
2020-08-10  1:02 ` [RFC PATCH v2 2/5] jbd2: introduce journal callbacks j_submit|finish_inode_data_buffers Mauricio Faria de Oliveira
2020-08-18 14:52   ` Jan Kara
2020-08-19  1:20     ` Mauricio Faria de Oliveira
2020-08-19  9:16       ` Jan Kara
2020-08-10  1:02 ` [RFC PATCH v2 3/5] ext4: data=journal: write-protect pages on submit inode data buffers callback Mauricio Faria de Oliveira
2020-08-19  8:44   ` Jan Kara
2020-08-19 10:41     ` Jan Kara
2020-08-20 22:55       ` Mauricio Faria de Oliveira
2020-08-21 10:26         ` Jan Kara
2020-08-10  1:02 ` [RFC PATCH v2 4/5] ext4: data=journal: add inode to transaction inode list in ext4_page_mkwrite() Mauricio Faria de Oliveira
2020-08-10  1:02 ` [RFC PATCH v2 5/5] ext4/jbd2: debugging messages Mauricio Faria de Oliveira
2020-08-10  1:48   ` kernel test robot
2020-08-10  2:51   ` kernel test robot
2020-08-19  8:46   ` Jan Kara
2020-08-10  1:02 ` [RFC PATCH v2/SETUP SCRIPT] Mauricio Faria de Oliveira
2020-08-10  1:02 ` [RFC PATCH v2/TEST CASE] Mauricio Faria de Oliveira
2020-08-19  9:27 ` [RFC PATCH v2 0/5] ext4/jbd2: data=journal: write-protect pages on transaction commit Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.