linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/13] ext4: add fast commit support
@ 2019-10-01  7:40 Harshad Shirwadkar
  2019-10-01  7:40 ` [PATCH v3 01/13] ext4: add handling for extended mount options Harshad Shirwadkar
                   ` (12 more replies)
  0 siblings, 13 replies; 36+ messages in thread
From: Harshad Shirwadkar @ 2019-10-01  7:40 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar

This patch series adds support for fast commits which is a simplified
version of the scheme proposed by Park and Shin, in their paper,
"iJournaling: Fine-Grained Journaling for Improving the Latency of
Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
give the client file system an opportunity to perform a faster
commit. Only if the file system cannot perform such a commit
operation, then JBD2 should fall back to traditional commits.

Because JBD2 operates at block granularity, for every file system
metadata update it commits all the changed blocks are written to the
journal at commit time. This is inefficient because updates to some
blocks that JBD2 commits are derivable from some other blocks. For
example, if a new extent is added to an inode, then corresponding
updates to the inode table, the block bitmap, the group descriptor and
the superblock can be derived based on just the extent information and
the corresponding inode information. So, if we take this relationship
between blocks into account and replay the journalled blocks smartly,
we could increase performance of file system commits significantly.

Fast commits introduced in this patch have two main contributions:

(1) Making JBD2 fast commit aware, so that clients of JBD2 can
    implement fast commits

(2) Add support in ext4 to use JBD2's new interfaces and implement
    fast commits

Testing
-------

e2fsprogs was updated to set fast commit feature flag and to ignore
fast commit blocks during e2fsck.

https://github.com/harshadjs/e2fsprogs.git

After applying all the patches in this series, following runs of
xfstests were performed:

- kvm-xfstest.sh -g log -c 4k
- kvm-xfstests.sh smoke

All the log tests were successful and smoke tests didn't introduce any
additional failures.

Performance Evaluation
----------------------

Ext4 file system performance was tested with and without fast commit
using fs_mark benchmark. Following was the command used:

Command: ./fs_mark -t 8 -n 1024 -s 65536 -w 4096 -d /mnt

Results:
Without Fast Commit: 1501.2 files/sec
With Fast commits: 3055 files/sec
~103% write performance improvement

Changes since V2:

- Added ability to support new file creation in fast commits. This
  allows us to use fs_mark benchmark for performance testing

- Added support for asynchronous fast commits

- Many cleanups and bug fixes

- Re-organized the patch set, moved most of the new code to
  ext4_jbd2.c instead of super.c

- Handling of review comments on previous patchset

Harshad Shirwadkar(13):
 docs: Add fast commit documentation
 ext4: add support for asynchronous fast commits
 ext4: fast-commit recovery path changes
 ext4: fast-commit commit path changes
 ext4: fast-commit commit range tracking
 ext4: track changed files for fast commit
 ext4: add fields that are needed to track changed files
 jbd2: fast-commit recovery path changes
 jbd2: fast-commit commit path new APIs
 jbd2: fast-commit commit path changes
 jbd2: fast commit setup and enable
 ext4: add handling for extended mount options
 ext4: add fast commit support

 Documentation/filesystems/ext4/journal.rst |   98 +-
 Documentation/filesystems/journalling.rst  |   22
 fs/ext4/acl.c                              |    1
 fs/ext4/balloc.c                           |    7
 fs/ext4/ext4.h                             |   86 +
 fs/ext4/ext4_jbd2.c                        |  902 +++++++++++++++++++
 fs/ext4/ext4_jbd2.h                        |   98 ++
 fs/ext4/extents.c                          |   43
 fs/ext4/fsync.c                            |    7
 fs/ext4/ialloc.c                           |   60 -
 fs/ext4/inline.c                           |   14
 fs/ext4/inode.c                            |   77 +
 fs/ext4/ioctl.c                            |    9
 fs/ext4/mballoc.c                          |   83 +
 fs/ext4/mballoc.h                          |    2
 fs/ext4/migrate.c                          |    1
 fs/ext4/namei.c                            |   16
 fs/ext4/super.c                            |   55 +
 fs/ext4/xattr.c                            |    1
 fs/jbd2/commit.c                           |   98 ++
 fs/jbd2/journal.c                          |  343 ++++++-
 fs/jbd2/recovery.c                         |   63 +
 fs/jbd2/transaction.c                      |    3
 include/linux/jbd2.h                       |  117 ++
 include/trace/events/ext4.h                |   61 +
 include/trace/events/jbd2.h                |    9
 26 files changed, 2170 insertions(+), 106 deletions(-)
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 01/13] ext4: add handling for extended mount options
  2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
@ 2019-10-01  7:40 ` Harshad Shirwadkar
  2019-10-16  2:14   ` Theodore Y. Ts'o
  2019-10-01  7:40 ` [PATCH v3 02/13] jbd2: fast commit setup and enable Harshad Shirwadkar
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Harshad Shirwadkar @ 2019-10-01  7:40 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar, Andreas Dilger, Theodore Ts'o

We are running out of mount option bits. This patch adds handling for
using s_mount_opt2 and also adds ability to turn on / off the fast
commit feature. In order to use fast commits, new version e2fsprogs
needs to set the fast feature commit flag. This also makes sure that
we have fast commit compatible e2fsprogs before starting to use the
feature. Mount flag "no_fastcommit", introuced in this patch, can be
passed to disable the feature at mount time.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/ext4.h       |  4 ++++
 fs/ext4/super.c      | 27 ++++++++++++++++++++++-----
 include/linux/jbd2.h |  5 ++++-
 3 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf660aa7a9e0..becbda38b7db 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1146,6 +1146,8 @@ struct ext4_inode_info {
 #define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM	0x00000008 /* User explicitly
 						specified journal checksum */
 
+#define EXT4_MOUNT2_JOURNAL_FAST_COMMIT	0x00000010 /* Journal fast commit */
+
 #define clear_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt &= \
 						~EXT4_MOUNT_##opt
 #define set_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt |= \
@@ -1643,6 +1645,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
 #define EXT4_FEATURE_COMPAT_RESIZE_INODE	0x0010
 #define EXT4_FEATURE_COMPAT_DIR_INDEX		0x0020
 #define EXT4_FEATURE_COMPAT_SPARSE_SUPER2	0x0200
+#define EXT4_FEATURE_COMPAT_FAST_COMMIT		0x0400
 
 #define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER	0x0001
 #define EXT4_FEATURE_RO_COMPAT_LARGE_FILE	0x0002
@@ -1743,6 +1746,7 @@ EXT4_FEATURE_COMPAT_FUNCS(xattr,		EXT_ATTR)
 EXT4_FEATURE_COMPAT_FUNCS(resize_inode,		RESIZE_INODE)
 EXT4_FEATURE_COMPAT_FUNCS(dir_index,		DIR_INDEX)
 EXT4_FEATURE_COMPAT_FUNCS(sparse_super2,	SPARSE_SUPER2)
+EXT4_FEATURE_COMPAT_FUNCS(fast_commit,		FAST_COMMIT)
 
 EXT4_FEATURE_RO_COMPAT_FUNCS(sparse_super,	SPARSE_SUPER)
 EXT4_FEATURE_RO_COMPAT_FUNCS(large_file,	LARGE_FILE)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4079605d437a..e376ac040cce 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1455,6 +1455,7 @@ enum {
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
 	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
+	Opt_no_fastcommit
 };
 
 static const match_table_t tokens = {
@@ -1537,6 +1538,7 @@ static const match_table_t tokens = {
 	{Opt_init_itable, "init_itable=%u"},
 	{Opt_init_itable, "init_itable"},
 	{Opt_noinit_itable, "noinit_itable"},
+	{Opt_no_fastcommit, "no_fastcommit"},
 	{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
 	{Opt_test_dummy_encryption, "test_dummy_encryption"},
 	{Opt_nombcache, "nombcache"},
@@ -1659,6 +1661,7 @@ static int clear_qf_name(struct super_block *sb, int qtype)
 #define MOPT_NO_EXT3	0x0200
 #define MOPT_EXT4_ONLY	(MOPT_NO_EXT2 | MOPT_NO_EXT3)
 #define MOPT_STRING	0x0400
+#define MOPT_2		0x0800
 
 static const struct mount_opts {
 	int	token;
@@ -1751,6 +1754,8 @@ static const struct mount_opts {
 	{Opt_max_dir_size_kb, 0, MOPT_GTE0},
 	{Opt_test_dummy_encryption, 0, MOPT_GTE0},
 	{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
+	{Opt_no_fastcommit, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
+	 MOPT_CLEAR | MOPT_2 | MOPT_EXT4_ONLY},
 	{Opt_err, 0, 0}
 };
 
@@ -1858,8 +1863,9 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 			set_opt2(sb, EXPLICIT_DELALLOC);
 		} else if (m->mount_opt & EXT4_MOUNT_JOURNAL_CHECKSUM) {
 			set_opt2(sb, EXPLICIT_JOURNAL_CHECKSUM);
-		} else
+		} else if (m->mount_opt) {
 			return -1;
+		}
 	}
 	if (m->flags & MOPT_CLEAR_ERR)
 		clear_opt(sb, ERRORS_MASK);
@@ -2027,10 +2033,17 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 			WARN_ON(1);
 			return -1;
 		}
-		if (arg != 0)
-			sbi->s_mount_opt |= m->mount_opt;
-		else
-			sbi->s_mount_opt &= ~m->mount_opt;
+		if (m->flags & MOPT_2) {
+			if (arg != 0)
+				sbi->s_mount_opt2 |= m->mount_opt;
+			else
+				sbi->s_mount_opt2 &= ~m->mount_opt;
+		} else {
+			if (arg != 0)
+				sbi->s_mount_opt |= m->mount_opt;
+			else
+				sbi->s_mount_opt &= ~m->mount_opt;
+		}
 	}
 	return 1;
 }
@@ -3733,6 +3746,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 #ifdef CONFIG_EXT4_FS_POSIX_ACL
 	set_opt(sb, POSIX_ACL);
 #endif
+	if (ext4_has_feature_fast_commit(sb))
+		set_opt2(sb, JOURNAL_FAST_COMMIT);
+
 	/* don't forget to enable journal_csum when metadata_csum is enabled. */
 	if (ext4_has_metadata_csum(sb))
 		set_opt(sb, JOURNAL_CHECKSUM);
@@ -4334,6 +4350,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		sbi->s_def_mount_opt &= ~EXT4_MOUNT_JOURNAL_CHECKSUM;
 		clear_opt(sb, JOURNAL_CHECKSUM);
 		clear_opt(sb, DATA_FLAGS);
+		clear_opt2(sb, JOURNAL_FAST_COMMIT);
 		sbi->s_journal = NULL;
 		needs_recovery = 0;
 		goto no_journal;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index df03825ad1a1..b7eed49b8ecd 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -288,6 +288,7 @@ typedef struct journal_superblock_s
 #define JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT	0x00000004
 #define JBD2_FEATURE_INCOMPAT_CSUM_V2		0x00000008
 #define JBD2_FEATURE_INCOMPAT_CSUM_V3		0x00000010
+#define JBD2_FEATURE_INCOMPAT_FAST_COMMIT	0x00000020
 
 /* See "journal feature predicate functions" below */
 
@@ -298,7 +299,8 @@ typedef struct journal_superblock_s
 					JBD2_FEATURE_INCOMPAT_64BIT | \
 					JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT | \
 					JBD2_FEATURE_INCOMPAT_CSUM_V2 | \
-					JBD2_FEATURE_INCOMPAT_CSUM_V3)
+					JBD2_FEATURE_INCOMPAT_CSUM_V3 | \
+					JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
 
 #ifdef __KERNEL__
 
@@ -1235,6 +1237,7 @@ JBD2_FEATURE_INCOMPAT_FUNCS(64bit,		64BIT)
 JBD2_FEATURE_INCOMPAT_FUNCS(async_commit,	ASYNC_COMMIT)
 JBD2_FEATURE_INCOMPAT_FUNCS(csum2,		CSUM_V2)
 JBD2_FEATURE_INCOMPAT_FUNCS(csum3,		CSUM_V3)
+JBD2_FEATURE_INCOMPAT_FUNCS(fast_commit,	FAST_COMMIT)
 
 /*
  * Journal flag definitions
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 02/13] jbd2: fast commit setup and enable
  2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
  2019-10-01  7:40 ` [PATCH v3 01/13] ext4: add handling for extended mount options Harshad Shirwadkar
@ 2019-10-01  7:40 ` Harshad Shirwadkar
  2019-10-16 13:03   ` Theodore Y. Ts'o
  2019-10-01  7:40 ` [PATCH v3 03/13] jbd2: fast-commit commit path changes Harshad Shirwadkar
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Harshad Shirwadkar @ 2019-10-01  7:40 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar

This patch allows file systems to turn fast commits on and thereby
restrict the normal journalling space to total journal blocks minus
JBD2_FAST_COMMIT_BLOCKS. Fast commits are not actually performed, just
the interface to turn fast commits on is opened.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/super.c      |  5 +++-
 fs/jbd2/journal.c    | 68 +++++++++++++++++++++++++++++++++-----------
 include/linux/jbd2.h | 39 +++++++++++++++++++++++++
 3 files changed, 95 insertions(+), 17 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e376ac040cce..7725eb2105f4 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4933,7 +4933,10 @@ static int ext4_load_journal(struct super_block *sb,
 		if (save)
 			memcpy(save, ((char *) es) +
 			       EXT4_S_ERR_START, EXT4_S_ERR_LEN);
-		err = jbd2_journal_load(journal);
+		if (test_opt2(sb, JOURNAL_FAST_COMMIT))
+			err = jbd2_journal_load_with_fc(journal);
+		else
+			err = jbd2_journal_load(journal);
 		if (save)
 			memcpy(((char *) es) + EXT4_S_ERR_START,
 			       save, EXT4_S_ERR_LEN);
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 953990eb70a9..7c13834873ad 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1159,12 +1159,15 @@ static journal_t *journal_init_common(struct block_device *bdev,
 	journal->j_blk_offset = start;
 	journal->j_maxlen = len;
 	n = journal->j_blocksize / sizeof(journal_block_tag_t);
-	journal->j_wbufsize = n;
+	journal->j_wbufsize = n - JBD2_FAST_COMMIT_BLOCKS;
 	journal->j_wbuf = kmalloc_array(n, sizeof(struct buffer_head *),
 					GFP_KERNEL);
 	if (!journal->j_wbuf)
 		goto err_cleanup;
 
+	journal->j_fc_wbuf = &journal->j_wbuf[journal->j_wbufsize];
+	journal->j_fc_wbufsize = JBD2_FAST_COMMIT_BLOCKS;
+
 	bh = getblk_unmovable(journal->j_dev, start, journal->j_blocksize);
 	if (!bh) {
 		pr_err("%s: Cannot get buffer for journal superblock\n",
@@ -1297,11 +1300,19 @@ static int journal_reset(journal_t *journal)
 	}
 
 	journal->j_first = first;
-	journal->j_last = last;
 
-	journal->j_head = first;
-	journal->j_tail = first;
-	journal->j_free = last - first;
+	if (jbd2_has_feature_fast_commit(journal)) {
+		journal->j_last_fc = last;
+		journal->j_last = last - JBD2_FAST_COMMIT_BLOCKS;
+		journal->j_first_fc = journal->j_last + 1;
+		journal->j_fc_off = 0;
+	} else {
+		journal->j_last = last;
+	}
+
+	journal->j_head = journal->j_first;
+	journal->j_tail = journal->j_first;
+	journal->j_free = journal->j_last - journal->j_first;
 
 	journal->j_tail_sequence = journal->j_transaction_sequence;
 	journal->j_commit_sequence = journal->j_transaction_sequence - 1;
@@ -1626,22 +1637,21 @@ static int load_superblock(journal_t *journal)
 	journal->j_tail_sequence = be32_to_cpu(sb->s_sequence);
 	journal->j_tail = be32_to_cpu(sb->s_start);
 	journal->j_first = be32_to_cpu(sb->s_first);
-	journal->j_last = be32_to_cpu(sb->s_maxlen);
 	journal->j_errno = be32_to_cpu(sb->s_errno);
 
+	if (jbd2_has_feature_fast_commit(journal)) {
+		journal->j_last_fc = be32_to_cpu(sb->s_maxlen);
+		journal->j_last = journal->j_last_fc - JBD2_FAST_COMMIT_BLOCKS;
+		journal->j_first_fc = journal->j_last + 1;
+		journal->j_fc_off = 0;
+	} else {
+		journal->j_last = be32_to_cpu(sb->s_maxlen);
+	}
+
 	return 0;
 }
 
-
-/**
- * int jbd2_journal_load() - Read journal from disk.
- * @journal: Journal to act on.
- *
- * Given a journal_t structure which tells us which disk blocks contain
- * a journal, read the journal from disk to initialise the in-memory
- * structures.
- */
-int jbd2_journal_load(journal_t *journal)
+static int __jbd2_journal_load(journal_t *journal, bool enable_fc)
 {
 	int err;
 	journal_superblock_t *sb;
@@ -1684,6 +1694,12 @@ int jbd2_journal_load(journal_t *journal)
 		return -EFSCORRUPTED;
 	}
 
+	if (enable_fc)
+		jbd2_journal_set_features(journal, 0, 0,
+					  JBD2_FEATURE_INCOMPAT_FAST_COMMIT);
+	else
+		jbd2_journal_clear_features(journal, 0, 0,
+					    JBD2_FEATURE_INCOMPAT_FAST_COMMIT);
 	/* OK, we've finished with the dynamic journal bits:
 	 * reinitialise the dynamic contents of the superblock in memory
 	 * and reset them on disk. */
@@ -1699,6 +1715,26 @@ int jbd2_journal_load(journal_t *journal)
 	return -EIO;
 }
 
+/**
+ * int jbd2_journal_load() - Read journal from disk.
+ * @journal: Journal to act on.
+ *
+ * Given a journal_t structure which tells us which disk blocks contain
+ * a journal, read the journal from disk to initialise the in-memory
+ * structures.
+ */
+int jbd2_journal_load(journal_t *journal)
+{
+	return __jbd2_journal_load(journal, false);
+}
+
+/* Same as above but also enables fast commits. */
+int jbd2_journal_load_with_fc(journal_t *journal)
+{
+	return __jbd2_journal_load(journal, true);
+}
+
+
 /**
  * void jbd2_journal_destroy() - Release a journal_t structure.
  * @journal: Journal to act on.
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index b7eed49b8ecd..84d04e1f3d92 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -67,6 +67,7 @@ extern void *jbd2_alloc(size_t size, gfp_t flags);
 extern void jbd2_free(void *ptr, size_t size);
 
 #define JBD2_MIN_JOURNAL_BLOCKS 1024
+#define JBD2_FAST_COMMIT_BLOCKS 128
 
 #ifdef __KERNEL__
 
@@ -918,6 +919,30 @@ struct journal_s
 	 */
 	unsigned long		j_last;
 
+	/**
+	 * @j_first_fc:
+	 *
+	 * The block number of the first fast commit block in the journal
+	 * [j_state_lock].
+	 */
+	unsigned long		j_first_fc;
+
+	/**
+	 * @j_fc_off:
+	 *
+	 * Number of fast commit blocks currently allocated.
+	 * [j_state_lock].
+	 */
+	unsigned long		j_fc_off;
+
+	/**
+	 * @j_last_fc:
+	 *
+	 * The block number one beyond the last fast commit block in the journal
+	 * [j_state_lock].
+	 */
+	unsigned long		j_last_fc;
+
 	/**
 	 * @j_dev: Device where we store the journal.
 	 */
@@ -1061,6 +1086,12 @@ struct journal_s
 	 */
 	struct buffer_head	**j_wbuf;
 
+	/**
+	 * @j_fc_wbuf: Array of fast commit bhs for
+	 * jbd2_journal_commit_transaction.
+	 */
+	struct buffer_head	**j_fc_wbuf;
+
 	/**
 	 * @j_wbufsize:
 	 *
@@ -1068,6 +1099,13 @@ struct journal_s
 	 */
 	int			j_wbufsize;
 
+	/**
+	 * @j_fc_wbufsize:
+	 *
+	 * Size of @j_fc_wbuf array.
+	 */
+	int			j_fc_wbufsize;
+
 	/**
 	 * @j_last_sync_writer:
 	 *
@@ -1398,6 +1436,7 @@ extern int	   jbd2_journal_set_features
 extern void	   jbd2_journal_clear_features
 		   (journal_t *, unsigned long, unsigned long, unsigned long);
 extern int	   jbd2_journal_load       (journal_t *journal);
+extern int	   jbd2_journal_load_with_fc(journal_t *journal);
 extern int	   jbd2_journal_destroy    (journal_t *);
 extern int	   jbd2_journal_recover    (journal_t *journal);
 extern int	   jbd2_journal_wipe       (journal_t *, int);
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 03/13] jbd2: fast-commit commit path changes
  2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
  2019-10-01  7:40 ` [PATCH v3 01/13] ext4: add handling for extended mount options Harshad Shirwadkar
  2019-10-01  7:40 ` [PATCH v3 02/13] jbd2: fast commit setup and enable Harshad Shirwadkar
@ 2019-10-01  7:40 ` Harshad Shirwadkar
  2019-10-16 16:38   ` Theodore Y. Ts'o
  2019-10-01  7:40 ` [PATCH v3 04/13] jbd2: fast-commit commit path new APIs Harshad Shirwadkar
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Harshad Shirwadkar @ 2019-10-01  7:40 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar

This patch adds core fast-commit commit path changes. This patch also
modifies existing JBD2 APIs to allow usage of fast commits. If fast
commits are enabled and journal->j_do_full_commit is not set, the
commit routine tries the file system specific fast commmit first. Only
if it fails, it falls back to the full commit. Commit start and wait
routines have their own variants that support fast commits.

In this patch we also add a new entry to journal->stats which counts
the number of fast commits performed.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/jbd2/commit.c            | 55 ++++++++++++++++++++--
 fs/jbd2/journal.c           | 94 ++++++++++++++++++++++++++++++++-----
 fs/jbd2/transaction.c       |  1 +
 include/linux/jbd2.h        | 42 ++++++++++++++++-
 include/trace/events/jbd2.h |  9 ++--
 5 files changed, 182 insertions(+), 19 deletions(-)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 132fb92098c7..7db3e2b6336d 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -351,8 +351,12 @@ static void jbd2_block_tag_csum_set(journal_t *j, journal_block_tag_t *tag,
  *
  * The primary function for committing a transaction to the log.  This
  * function is called by the journal thread to begin a complete commit.
+ *
+ * fc is input / output parameter. If fc is non-null and is set to true, this
+ * function tries to perform fast commit. If the fast commit is successfully
+ * performed, *fc is set to true.
  */
-void jbd2_journal_commit_transaction(journal_t *journal)
+void jbd2_journal_commit_transaction(journal_t *journal, bool *fc)
 {
 	struct transaction_stats_s stats;
 	transaction_t *commit_transaction;
@@ -380,6 +384,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	tid_t first_tid;
 	int update_tail;
 	int csum_size = 0;
+	bool full_commit;
 	LIST_HEAD(io_bufs);
 	LIST_HEAD(log_bufs);
 
@@ -413,6 +418,44 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	J_ASSERT(journal->j_running_transaction != NULL);
 	J_ASSERT(journal->j_committing_transaction == NULL);
 
+	write_lock(&journal->j_state_lock);
+	full_commit = journal->j_do_full_commit;
+	write_unlock(&journal->j_state_lock);
+
+	/* Let file-system try its own fast commit */
+	if (jbd2_has_feature_fast_commit(journal)) {
+		if (!full_commit && fc && *fc == true &&
+		    journal->j_fc_commit_callback &&
+		    !journal->j_fc_commit_callback(
+			journal, journal->j_running_transaction->t_tid,
+			journal->j_running_transaction->t_subtid, &stats.run)) {
+			jbd_debug(3, "fast commit success.\n");
+			if (journal->j_fc_cleanup_callback)
+				journal->j_fc_cleanup_callback(journal);
+			write_lock(&journal->j_state_lock);
+			journal->j_fc_sequence = journal->j_running_transaction
+						 ->t_subtid;
+			journal->j_running_transaction->t_subtid++;
+			if (fc)
+				*fc = true;
+			write_unlock(&journal->j_state_lock);
+			trace_jbd2_run_stats(journal->j_fs_dev->bd_dev,
+					     journal->j_running_transaction
+					     ->t_tid,
+					     &stats.run, true);
+			goto update_overall_stats;
+		}
+		if (journal->j_fc_cleanup_callback)
+			journal->j_fc_cleanup_callback(journal);
+		write_lock(&journal->j_state_lock);
+		journal->j_do_full_commit = false;
+		write_unlock(&journal->j_state_lock);
+	}
+
+	jbd_debug(3, "fast commit not performed, trying full.\n");
+	if (fc)
+		*fc = false;
+
 	commit_transaction = journal->j_running_transaction;
 
 	trace_jbd2_start_commit(journal, commit_transaction);
@@ -420,6 +463,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 			commit_transaction->t_tid);
 
 	write_lock(&journal->j_state_lock);
+	journal->j_fc_off = 0;
 	J_ASSERT(commit_transaction->t_state == T_RUNNING);
 	commit_transaction->t_state = T_LOCKED;
 
@@ -1085,12 +1129,13 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	stats.run.rs_handle_count =
 		atomic_read(&commit_transaction->t_handle_count);
 	trace_jbd2_run_stats(journal->j_fs_dev->bd_dev,
-			     commit_transaction->t_tid, &stats.run);
+			     commit_transaction->t_tid, &stats.run, false);
 	stats.ts_requested = (commit_transaction->t_requested) ? 1 : 0;
 
 	commit_transaction->t_state = T_COMMIT_CALLBACK;
 	J_ASSERT(commit_transaction == journal->j_committing_transaction);
 	journal->j_commit_sequence = commit_transaction->t_tid;
+	journal->j_fc_sequence = 0;
 	journal->j_committing_transaction = NULL;
 	commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
 
@@ -1129,8 +1174,12 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	/*
 	 * Calculate overall stats
 	 */
+update_overall_stats:
 	spin_lock(&journal->j_history_lock);
-	journal->j_stats.ts_tid++;
+	if (fc && *fc == true)
+		journal->j_stats.ts_num_fast_commits++;
+	else
+		journal->j_stats.ts_tid++;
 	journal->j_stats.ts_requested += stats.ts_requested;
 	journal->j_stats.run.rs_wait += stats.run.rs_wait;
 	journal->j_stats.run.rs_request_delay += stats.run.rs_request_delay;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 7c13834873ad..6853064605ff 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -160,7 +160,13 @@ static void commit_timeout(struct timer_list *t)
  *
  * 1) COMMIT:  Every so often we need to commit the current state of the
  *    filesystem to disk.  The journal thread is responsible for writing
- *    all of the metadata buffers to disk.
+ *    all of the metadata buffers to disk. If fast commits are allowed,
+ *    journal thread passes the control to the file system and file system
+ *    is then responsible for writing metadata buffers to disk (in whichever
+ *    format it wants). If fast commit succeds, journal thread won't perform
+ *    a normal commit. In case the fast commit fails, journal thread performs
+ *    full commit as normal.
+ *
  *
  * 2) CHECKPOINT: We cannot reuse a used section of the log file until all
  *    of the data in that part of the log has been rewritten elsewhere on
@@ -172,6 +178,7 @@ static int kjournald2(void *arg)
 {
 	journal_t *journal = arg;
 	transaction_t *transaction;
+	bool fc_flag = true, fc_flag_save;
 
 	/*
 	 * Set up an interval timer which can be used to trigger a commit wakeup
@@ -209,9 +216,14 @@ static int kjournald2(void *arg)
 		jbd_debug(1, "OK, requests differ\n");
 		write_unlock(&journal->j_state_lock);
 		del_timer_sync(&journal->j_commit_timer);
-		jbd2_journal_commit_transaction(journal);
+		fc_flag_save = fc_flag;
+		jbd2_journal_commit_transaction(journal, &fc_flag);
 		write_lock(&journal->j_state_lock);
-		goto loop;
+		if (!fc_flag) {
+			/* fast commit not performed */
+			fc_flag = fc_flag_save;
+			goto loop;
+		}
 	}
 
 	wake_up(&journal->j_wait_done_commit);
@@ -235,16 +247,18 @@ static int kjournald2(void *arg)
 
 		prepare_to_wait(&journal->j_wait_commit, &wait,
 				TASK_INTERRUPTIBLE);
-		if (journal->j_commit_sequence != journal->j_commit_request)
+		if (!fc_flag &&
+		    journal->j_commit_sequence != journal->j_commit_request)
 			should_sleep = 0;
 		transaction = journal->j_running_transaction;
 		if (transaction && time_after_eq(jiffies,
-						transaction->t_expires))
+						 transaction->t_expires))
 			should_sleep = 0;
 		if (journal->j_flags & JBD2_UNMOUNT)
 			should_sleep = 0;
 		if (should_sleep) {
 			write_unlock(&journal->j_state_lock);
+			jbd_debug(1, "%s sleeps\n", __func__);
 			schedule();
 			write_lock(&journal->j_state_lock);
 		}
@@ -259,7 +273,10 @@ static int kjournald2(void *arg)
 	transaction = journal->j_running_transaction;
 	if (transaction && time_after_eq(jiffies, transaction->t_expires)) {
 		journal->j_commit_request = transaction->t_tid;
+		fc_flag = false;
 		jbd_debug(1, "woke because of timeout\n");
+	} else {
+		fc_flag = true;
 	}
 	goto loop;
 
@@ -522,11 +539,23 @@ int jbd2_log_start_commit(journal_t *journal, tid_t tid)
 	int ret;
 
 	write_lock(&journal->j_state_lock);
+	journal->j_do_full_commit = true;
 	ret = __jbd2_log_start_commit(journal, tid);
 	write_unlock(&journal->j_state_lock);
 	return ret;
 }
 
+int jbd2_log_start_commit_fast(journal_t *journal, tid_t tid)
+{
+	int ret;
+
+	write_lock(&journal->j_state_lock);
+	ret = __jbd2_log_start_commit(journal, tid);
+	write_unlock(&journal->j_state_lock);
+
+	return ret;
+}
+
 /*
  * Force and wait any uncommitted transactions.  We can only force the running
  * transaction if we don't have an active handle, otherwise, we will deadlock.
@@ -603,11 +632,15 @@ int jbd2_journal_force_commit(journal_t *journal)
  * if a transaction is going to be committed (or is currently already
  * committing), and fills its tid in at *ptid
  */
-int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
+int __jbd2_journal_start_commit(journal_t *journal, tid_t *ptid,
+				bool full_commit)
 {
 	int ret = 0;
 
 	write_lock(&journal->j_state_lock);
+	if (!journal->j_do_full_commit)
+		journal->j_do_full_commit = full_commit;
+
 	if (journal->j_running_transaction) {
 		tid_t tid = journal->j_running_transaction->t_tid;
 
@@ -630,6 +663,16 @@ int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
 	return ret;
 }
 
+int jbd2_journal_start_commit_fast(journal_t *journal, tid_t *ptid)
+{
+	return __jbd2_journal_start_commit(journal, ptid, false);
+}
+
+int jbd2_journal_start_commit(journal_t *journal, tid_t *ptid)
+{
+	return __jbd2_journal_start_commit(journal, ptid, true);
+}
+
 /*
  * Return 1 if a given transaction has not yet sent barrier request
  * connected with a transaction commit. If 0 is returned, transaction
@@ -675,7 +718,7 @@ EXPORT_SYMBOL(jbd2_trans_will_send_data_barrier);
  * Wait for a specified commit to complete.
  * The caller may not hold the journal lock.
  */
-int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
+int __jbd2_log_wait_commit(journal_t *journal, tid_t tid, tid_t subtid)
 {
 	int err = 0;
 
@@ -702,12 +745,27 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
 	}
 #endif
 	while (tid_gt(tid, journal->j_commit_sequence)) {
-		jbd_debug(1, "JBD2: want %u, j_commit_sequence=%u\n",
-				  tid, journal->j_commit_sequence);
+		if ((!journal->j_do_full_commit) &&
+		    !tid_gt(subtid, journal->j_fc_sequence))
+			break;
+		jbd_debug(1, "JBD2: want full commit %u %s %u, ",
+			  tid, journal->j_do_full_commit ?
+			  "and ignoring fast commit request for " :
+			  "or want fast commit",
+			  journal->j_fc_sequence);
+		jbd_debug(1, "j_commit_sequence=%u, j_fc_sequence=%u\n",
+			  journal->j_commit_sequence,
+			  journal->j_fc_sequence);
 		read_unlock(&journal->j_state_lock);
 		wake_up(&journal->j_wait_commit);
-		wait_event(journal->j_wait_done_commit,
-				!tid_gt(tid, journal->j_commit_sequence));
+		if (journal->j_do_full_commit)
+			wait_event(journal->j_wait_done_commit,
+				   !tid_gt(tid, journal->j_commit_sequence));
+		else
+			wait_event(journal->j_wait_done_commit,
+				   !tid_gt(tid, journal->j_commit_sequence) ||
+				   !tid_gt(subtid,
+					    journal->j_fc_sequence));
 		read_lock(&journal->j_state_lock);
 	}
 	read_unlock(&journal->j_state_lock);
@@ -717,6 +775,13 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
 	return err;
 }
 
+int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
+{
+	journal->j_do_full_commit = true;
+	return __jbd2_log_wait_commit(journal, tid, 0);
+}
+
+
 /* Return 1 when transaction with given tid has already committed. */
 int jbd2_transaction_committed(journal_t *journal, tid_t tid)
 {
@@ -996,6 +1061,8 @@ static int jbd2_seq_info_show(struct seq_file *seq, void *v)
 		   "each up to %u blocks\n",
 		   s->stats->ts_tid, s->stats->ts_requested,
 		   s->journal->j_max_transaction_buffers);
+	seq_printf(seq, "%lu fast commits performed\n",
+		   s->stats->ts_num_fast_commits);
 	if (s->stats->ts_tid == 0)
 		return 0;
 	seq_printf(seq, "average: \n  %ums waiting for transaction\n",
@@ -1020,6 +1087,9 @@ static int jbd2_seq_info_show(struct seq_file *seq, void *v)
 	    s->stats->run.rs_blocks / s->stats->ts_tid);
 	seq_printf(seq, "  %lu logged blocks per transaction\n",
 	    s->stats->run.rs_blocks_logged / s->stats->ts_tid);
+	seq_printf(seq, "  %lu logged blocks per commit\n",
+	    s->stats->run.rs_blocks_logged /
+	    (s->stats->ts_tid + s->stats->ts_num_fast_commits));
 	return 0;
 }
 
@@ -1752,7 +1822,7 @@ int jbd2_journal_destroy(journal_t *journal)
 
 	/* Force a final log commit */
 	if (journal->j_running_transaction)
-		jbd2_journal_commit_transaction(journal);
+		jbd2_journal_commit_transaction(journal, NULL);
 
 	/* Force any old transactions to disk */
 
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 990e7b5062e7..ce7f03cfd90b 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -84,6 +84,7 @@ static void jbd2_get_transaction(journal_t *journal,
 	transaction->t_state = T_RUNNING;
 	transaction->t_start_time = ktime_get();
 	transaction->t_tid = journal->j_transaction_sequence++;
+	transaction->t_subtid = 1;
 	transaction->t_expires = jiffies + journal->j_commit_interval;
 	spin_lock_init(&transaction->t_handle_lock);
 	atomic_set(&transaction->t_updates, 0);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 84d04e1f3d92..41315f648c0f 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -580,6 +580,9 @@ struct transaction_s
 	/* Sequence number for this transaction [no locking] */
 	tid_t			t_tid;
 
+	/* Sequence number of the current ongoing fast commit [no locking] */
+	tid_t			t_subtid;
+
 	/*
 	 * Transaction's current state
 	 * [no locking - only kjournald2 alters this]
@@ -742,6 +745,7 @@ struct transaction_run_stats_s {
 
 struct transaction_stats_s {
 	unsigned long		ts_tid;
+	unsigned long		ts_num_fast_commits;
 	unsigned long		ts_requested;
 	struct transaction_run_stats_s run;
 };
@@ -943,6 +947,13 @@ struct journal_s
 	 */
 	unsigned long		j_last_fc;
 
+	/*
+	 * @j_do_full_commit:
+	 *
+	 * Force a full commit. If this flag is set JBD2 won't try fast commits
+	 */
+	bool			j_do_full_commit;
+
 	/**
 	 * @j_dev: Device where we store the journal.
 	 */
@@ -1012,6 +1023,14 @@ struct journal_s
 	 */
 	tid_t			j_transaction_sequence;
 
+	/**
+	 * @j_fc_sequence:
+	 *
+	 * The sequence number of the most recently committed fast
+	 * commit. [j_state_lock]
+	 */
+	tid_t			j_fc_sequence;
+
 	/**
 	 * @j_commit_sequence:
 	 *
@@ -1205,6 +1224,24 @@ struct journal_s
 	 */
 	struct lockdep_map	j_trans_commit_map;
 #endif
+	/**
+	 * @j_fc_commit_callback:
+	 *
+	 * File-system specific function that performs actual fast commit
+	 * operation. Should return 0 if the fast commit was successful, in that
+	 * case, JBD2 will just increment journal->j_subtid and move on. If it
+	 * returns < 0, JBD2 will fall-back to full commit.
+	 */
+	int (*j_fc_commit_callback)(struct journal_s *journal, tid_t tid,
+				    tid_t subtid,
+				    struct transaction_run_stats_s *stats);
+	/**
+	 * @j_fc_cleanup_callback:
+	 *
+	 * Clean-up after fast commit or full commit. JBD2 calls this function
+	 * after every commit operation.
+	 */
+	void (*j_fc_cleanup_callback)(struct journal_s *journal);
 };
 
 #define jbd2_might_wait_for_commit(j) \
@@ -1323,7 +1360,8 @@ int __jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
 void jbd2_update_log_tail(journal_t *journal, tid_t tid, unsigned long block);
 
 /* Commit management */
-extern void jbd2_journal_commit_transaction(journal_t *);
+extern void jbd2_journal_commit_transaction(journal_t *journal,
+					    bool *full_commit);
 
 /* Checkpoint list management */
 void __jbd2_journal_clean_checkpoint_list(journal_t *journal, bool destroy);
@@ -1532,8 +1570,10 @@ extern void	jbd2_clear_buffer_revoked_flags(journal_t *journal);
  */
 
 int jbd2_log_start_commit(journal_t *journal, tid_t tid);
+int jbd2_log_start_commit_fast(journal_t *journal, tid_t tid);
 int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
 int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
+int jbd2_journal_start_commit_fast(journal_t *journal, tid_t *tid);
 int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
 int jbd2_transaction_committed(journal_t *journal, tid_t tid);
 int jbd2_complete_transaction(journal_t *journal, tid_t tid);
diff --git a/include/trace/events/jbd2.h b/include/trace/events/jbd2.h
index 2310b259329f..af78bacdae83 100644
--- a/include/trace/events/jbd2.h
+++ b/include/trace/events/jbd2.h
@@ -233,9 +233,9 @@ TRACE_EVENT(jbd2_handle_stats,
 
 TRACE_EVENT(jbd2_run_stats,
 	TP_PROTO(dev_t dev, unsigned long tid,
-		 struct transaction_run_stats_s *stats),
+		 struct transaction_run_stats_s *stats, bool fc),
 
-	TP_ARGS(dev, tid, stats),
+	TP_ARGS(dev, tid, stats, fc),
 
 	TP_STRUCT__entry(
 		__field(		dev_t,	dev		)
@@ -249,6 +249,7 @@ TRACE_EVENT(jbd2_run_stats,
 		__field(		__u32,	handle_count	)
 		__field(		__u32,	blocks		)
 		__field(		__u32,	blocks_logged	)
+		__field(		 bool,	fc		)
 	),
 
 	TP_fast_assign(
@@ -263,11 +264,13 @@ TRACE_EVENT(jbd2_run_stats,
 		__entry->handle_count	= stats->rs_handle_count;
 		__entry->blocks		= stats->rs_blocks;
 		__entry->blocks_logged	= stats->rs_blocks_logged;
+		__entry->fc		= fc;
 	),
 
-	TP_printk("dev %d,%d tid %lu wait %u request_delay %u running %u "
+	TP_printk("%s commit, dev %d,%d tid %lu wait %u request_delay %u running %u "
 		  "locked %u flushing %u logging %u handle_count %u "
 		  "blocks %u blocks_logged %u",
+		  __entry->fc ? "fast" : "full",
 		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->tid,
 		  jiffies_to_msecs(__entry->wait),
 		  jiffies_to_msecs(__entry->request_delay),
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 04/13] jbd2: fast-commit commit path new APIs
  2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
                   ` (2 preceding siblings ...)
  2019-10-01  7:40 ` [PATCH v3 03/13] jbd2: fast-commit commit path changes Harshad Shirwadkar
@ 2019-10-01  7:40 ` Harshad Shirwadkar
  2019-10-16 17:20   ` Theodore Y. Ts'o
  2019-10-01  7:40 ` [PATCH v3 05/13] jbd2: fast-commit recovery path changes Harshad Shirwadkar
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Harshad Shirwadkar @ 2019-10-01  7:40 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar

This patch adds new helper APIs that ext4 needs for fast
commits. These new fast commit APIs are used by subsequent fast commit
patches to implement fast commits. Following new APIs are added:

/*
 * Returns when either a full commit or a fast commit
 * completes
 */
int jbd2_fc_complete_commit(journal_tc *journal, tid_t tid,
			                tid_t subtid)

/* Send all the data buffers related to an inode */
int journal_submit_inode_data(journal_t *journal,
			                  struct jbd2_inode *jinode)

/* Map one fast commit buffer for use by the file system */
int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out)

/* Wait on fast commit buffers to complete IO */
jbd2_wait_on_fc_bufs(journal_t *journal, int num_bufs)

/*
 * Returns 1 if transaction identified by tid:subtid is already
 * committed.
 */
int jbd2_commit_check(journal_t *journal, tid_t tid, tid_t subtid)

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/jbd2/commit.c     |  32 +++++++++++++
 fs/jbd2/journal.c    | 110 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/jbd2.h |   8 ++++
 3 files changed, 150 insertions(+)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 7db3e2b6336d..e85f51e1cc70 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -202,6 +202,38 @@ static int journal_submit_inode_data_buffers(struct address_space *mapping,
 	return ret;
 }
 
+int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode)
+{
+	struct address_space *mapping;
+	loff_t dirty_start = jinode->i_dirty_start;
+	loff_t dirty_end = jinode->i_dirty_end;
+	int ret;
+
+	if (!jinode)
+		return 0;
+
+	if (!(jinode->i_flags & JI_WRITE_DATA))
+		return 0;
+
+	dirty_start = jinode->i_dirty_start;
+	dirty_end = jinode->i_dirty_end;
+
+	mapping = jinode->i_vfs_inode->i_mapping;
+	jinode->i_flags |= JI_COMMIT_RUNNING;
+
+	trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
+	ret = journal_submit_inode_data_buffers(mapping, dirty_start,
+						dirty_end);
+
+	jinode->i_flags &= ~JI_COMMIT_RUNNING;
+	/* Protect JI_COMMIT_RUNNING flag */
+	smp_mb();
+	wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);
+
+	return ret;
+}
+EXPORT_SYMBOL(jbd2_submit_inode_data);
+
 /*
  * Submit all the data buffers of inode associated with the transaction to
  * disk.
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 6853064605ff..14d549445418 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -781,6 +781,18 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
 	return __jbd2_log_wait_commit(journal, tid, 0);
 }
 
+int jbd2_commit_check(journal_t *journal, tid_t tid, tid_t subtid)
+{
+	if (journal->j_commit_sequence >= tid)
+		return 1;
+	if (!journal->j_running_transaction)
+		return 0;
+	if (journal->j_running_transaction->t_tid > tid)
+		return 1;
+	if (journal->j_running_transaction->t_subtid > subtid)
+		return 1;
+	return 0;
+}
 
 /* Return 1 when transaction with given tid has already committed. */
 int jbd2_transaction_committed(journal_t *journal, tid_t tid)
@@ -830,6 +842,33 @@ int jbd2_complete_transaction(journal_t *journal, tid_t tid)
 }
 EXPORT_SYMBOL(jbd2_complete_transaction);
 
+int jbd2_fc_complete_commit(journal_t *journal, tid_t tid, tid_t subtid)
+{
+	int	need_to_wait = 1;
+
+	read_lock(&journal->j_state_lock);
+	if (journal->j_running_transaction &&
+	    journal->j_running_transaction->t_tid == tid) {
+		/* Check if fast commit was already done */
+		if (tid_geq(journal->j_fc_sequence, subtid))
+			need_to_wait = 0;
+		if (journal->j_commit_request != tid) {
+			/* transaction not yet started, so request it */
+			read_unlock(&journal->j_state_lock);
+			jbd2_log_start_commit_fast(journal, tid);
+			goto wait_commit;
+		}
+	} else if (!(journal->j_committing_transaction &&
+		     journal->j_committing_transaction->t_tid == tid))
+		need_to_wait = 0;
+	read_unlock(&journal->j_state_lock);
+	if (!need_to_wait)
+		return 0;
+wait_commit:
+	return __jbd2_log_wait_commit(journal, tid, subtid);
+}
+EXPORT_SYMBOL(jbd2_fc_complete_commit);
+
 /*
  * Log buffer allocation routines:
  */
@@ -850,6 +889,77 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp)
 	return jbd2_journal_bmap(journal, blocknr, retp);
 }
 
+int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out)
+{
+	unsigned long long pblock;
+	unsigned long blocknr;
+	int ret = 0;
+	struct buffer_head *bh;
+	int fc_off;
+	journal_header_t *jhdr;
+
+	write_lock(&journal->j_state_lock);
+
+	if (journal->j_fc_off + journal->j_first_fc < journal->j_last_fc) {
+		fc_off = journal->j_fc_off;
+		blocknr = journal->j_first_fc + fc_off;
+		journal->j_fc_off++;
+	} else {
+		ret = -EINVAL;
+	}
+	write_unlock(&journal->j_state_lock);
+
+	if (ret)
+		return ret;
+
+	ret = jbd2_journal_bmap(journal, blocknr, &pblock);
+	if (ret)
+		return ret;
+
+	bh = __getblk(journal->j_dev, pblock, journal->j_blocksize);
+	if (!bh)
+		return -ENOMEM;
+
+	lock_buffer(bh);
+	jhdr = (journal_header_t *)bh->b_data;
+	jhdr->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
+	jhdr->h_blocktype = cpu_to_be32(JBD2_FC_BLOCK);
+	jhdr->h_sequence = cpu_to_be32(journal->j_running_transaction->t_tid);
+
+	set_buffer_uptodate(bh);
+	unlock_buffer(bh);
+	journal->j_fc_wbuf[fc_off] = bh;
+
+	*bh_out = bh;
+
+	return 0;
+}
+EXPORT_SYMBOL(jbd2_map_fc_buf);
+
+int jbd2_wait_on_fc_bufs(journal_t *journal, int num_blks)
+{
+	struct buffer_head *bh;
+	int i, j_fc_off;
+
+	read_lock(&journal->j_state_lock);
+	j_fc_off = journal->j_fc_off;
+	read_unlock(&journal->j_state_lock);
+
+	/*
+	 * Wait in reverse order to minimize chances of us being woken up before
+	 * all IOs have completed
+	 */
+	for (i = j_fc_off - 1; i >= j_fc_off - num_blks; i--) {
+		bh = journal->j_fc_wbuf[i];
+		wait_on_buffer(bh);
+		if (unlikely(!buffer_uptodate(bh)))
+			return -EIO;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(jbd2_wait_on_fc_bufs);
+
 /*
  * Conversion of logical to physical block numbers for the journal
  *
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 41315f648c0f..c6a2b82de4cf 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -124,6 +124,7 @@ typedef struct journal_s	journal_t;	/* Journal control structure */
 #define JBD2_SUPERBLOCK_V1	3
 #define JBD2_SUPERBLOCK_V2	4
 #define JBD2_REVOKE_BLOCK	5
+#define JBD2_FC_BLOCK		6
 
 /*
  * Standard header for all descriptor blocks:
@@ -1579,6 +1580,7 @@ int jbd2_transaction_committed(journal_t *journal, tid_t tid);
 int jbd2_complete_transaction(journal_t *journal, tid_t tid);
 int jbd2_log_do_checkpoint(journal_t *journal);
 int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
+int jbd2_fc_complete_commit(journal_t *journal, tid_t tid, tid_t subtid);
 
 void __jbd2_log_wait_for_space(journal_t *journal);
 extern void __jbd2_journal_drop_transaction(journal_t *, transaction_t *);
@@ -1729,6 +1731,12 @@ static inline tid_t  jbd2_get_latest_transaction(journal_t *journal)
 	return tid;
 }
 
+/* Fast commit related APIs */
+int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out);
+int jbd2_wait_on_fc_bufs(journal_t *journal, int num_blks);
+int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode);
+int jbd2_commit_check(journal_t *journal, tid_t tid, tid_t subtid);
+
 #ifdef __KERNEL__
 
 #define buffer_trace_init(bh)	do {} while (0)
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 05/13] jbd2: fast-commit recovery path changes
  2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
                   ` (3 preceding siblings ...)
  2019-10-01  7:40 ` [PATCH v3 04/13] jbd2: fast-commit commit path new APIs Harshad Shirwadkar
@ 2019-10-01  7:40 ` Harshad Shirwadkar
  2019-10-16 17:30   ` Theodore Y. Ts'o
  2019-10-01  7:40 ` [PATCH v3 06/13] ext4: add fields that are needed to track changed files Harshad Shirwadkar
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Harshad Shirwadkar @ 2019-10-01  7:40 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar

This patch adds fast-commit recovery path changes for JBD2. If we find
a fast commit block that is valid in our recovery phase call file
system specific routine to handle that block.

We also clear the fast commit flag in jbd2_mark_journal_empty() which
is called after successful recovery as well successful
checkpointing. This allows JBD2 journal to be compatible with older
versions when there are no fast commit blocks.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/jbd2/journal.c    | 12 +++++++++
 fs/jbd2/recovery.c   | 63 +++++++++++++++++++++++++++++++++++++++++---
 include/linux/jbd2.h | 13 +++++++++
 3 files changed, 84 insertions(+), 4 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 14d549445418..e0684212384d 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1635,6 +1635,7 @@ int jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid,
 static void jbd2_mark_journal_empty(journal_t *journal, int write_op)
 {
 	journal_superblock_t *sb = journal->j_superblock;
+	bool had_fast_commit = false;
 
 	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
 	lock_buffer(journal->j_sb_buffer);
@@ -1648,9 +1649,20 @@ static void jbd2_mark_journal_empty(journal_t *journal, int write_op)
 
 	sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
 	sb->s_start    = cpu_to_be32(0);
+	if (jbd2_has_feature_fast_commit(journal)) {
+		/*
+		 * When journal is clean, no need to commit fast commit flag and
+		 * make file system incompatible with older kernels.
+		 */
+		jbd2_clear_feature_fast_commit(journal);
+		had_fast_commit = true;
+	}
 
 	jbd2_write_superblock(journal, write_op);
 
+	if (had_fast_commit)
+		jbd2_set_feature_fast_commit(journal);
+
 	/* Log is no longer empty */
 	write_lock(&journal->j_state_lock);
 	journal->j_flags |= JBD2_FLUSHED;
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index a4967b27ffb6..c1f4c94ed375 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -35,7 +35,6 @@ struct recovery_info
 	int		nr_revoke_hits;
 };
 
-enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY};
 static int do_one_pass(journal_t *journal,
 				struct recovery_info *info, enum passtype pass);
 static int scan_revoke_records(journal_t *, struct buffer_head *,
@@ -225,8 +224,12 @@ static int count_tags(journal_t *journal, struct buffer_head *bh)
 /* Make sure we wrap around the log correctly! */
 #define wrap(journal, var)						\
 do {									\
-	if (var >= (journal)->j_last)					\
-		var -= ((journal)->j_last - (journal)->j_first);	\
+	unsigned long _wrap_last =					\
+		jbd2_has_feature_fast_commit(journal) ?			\
+			(journal)->j_last_fc : (journal)->j_last;	\
+									\
+	if (var >= _wrap_last)						\
+		var -= (_wrap_last - (journal)->j_first);		\
 } while (0)
 
 /**
@@ -413,6 +416,51 @@ static int jbd2_block_tag_csum_verify(journal_t *j, journal_block_tag_t *tag,
 		return tag->t_checksum == cpu_to_be16(csum32);
 }
 
+static int fc_do_one_pass(journal_t *journal,
+			  struct recovery_info *info, enum passtype pass)
+{
+	unsigned int expected_commit_id = info->end_transaction;
+	unsigned long next_fc_block;
+	struct buffer_head *bh;
+	unsigned int seq;
+	journal_header_t *jhdr;
+	int err = 0;
+
+	next_fc_block = journal->j_first_fc;
+
+	while (next_fc_block <= journal->j_last_fc) {
+		jbd_debug(3, "Fast commit replay: next block %lld",
+			  next_fc_block);
+		err = jread(&bh, journal, next_fc_block);
+		if (err)
+			break;
+
+		jhdr = (journal_header_t *)bh->b_data;
+		seq = be32_to_cpu(jhdr->h_sequence);
+		if (be32_to_cpu(jhdr->h_magic) != JBD2_MAGIC_NUMBER ||
+		    seq != expected_commit_id) {
+			break;
+		}
+		jbd_debug(3, "Processing fast commit blk with seq %d",
+			  seq);
+		if (journal->j_fc_replay_callback) {
+			err = journal->j_fc_replay_callback(
+						journal, bh, pass,
+						next_fc_block -
+						journal->j_first_fc);
+			if (err)
+				break;
+		}
+		next_fc_block++;
+	}
+
+	if (err)
+		jbd_debug(3, "Fast commit replay failed, err = %d\n", err);
+
+	return err;
+}
+
+
 static int do_one_pass(journal_t *journal,
 			struct recovery_info *info, enum passtype pass)
 {
@@ -470,7 +518,7 @@ static int do_one_pass(journal_t *journal,
 				break;
 
 		jbd_debug(2, "Scanning for sequence ID %u at %lu/%lu\n",
-			  next_commit_ID, next_log_block, journal->j_last);
+			  next_commit_ID, next_log_block, journal->j_last_fc);
 
 		/* Skip over each chunk of the transaction looking
 		 * either the next descriptor block or the final commit
@@ -768,6 +816,8 @@ static int do_one_pass(journal_t *journal,
 			if (err)
 				goto failed;
 			continue;
+		case JBD2_FC_BLOCK:
+			continue;
 
 		default:
 			jbd_debug(3, "Unrecognised magic %d, end of scan.\n",
@@ -799,6 +849,11 @@ static int do_one_pass(journal_t *journal,
 				success = -EIO;
 		}
 	}
+
+
+	if (jbd2_has_feature_fast_commit(journal) && pass != PASS_REVOKE)
+		fc_do_one_pass(journal, info, pass);
+
 	if (block_error && success == 0)
 		success = -EIO;
 	return success;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index c6a2b82de4cf..312103fc9581 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -762,6 +762,8 @@ jbd2_time_diff(unsigned long start, unsigned long end)
 
 #define JBD2_NR_BATCH	64
 
+enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY};
+
 /**
  * struct journal_s - The journal_s type is the concrete type associated with
  *     journal_t.
@@ -1243,6 +1245,17 @@ struct journal_s
 	 * after every commit operation.
 	 */
 	void (*j_fc_cleanup_callback)(struct journal_s *journal);
+
+	/*
+	 * @j_fc_replay_callback:
+	 *
+	 * File-system specific function that performs replay of a fast
+	 * commit. JBD2 calls this function for each fast commit block found in
+	 * the journal.
+	 */
+	int (*j_fc_replay_callback)(struct journal_s *journal,
+				    struct buffer_head *bh,
+				    enum passtype pass, int off);
 };
 
 #define jbd2_might_wait_for_commit(j) \
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 06/13] ext4: add fields that are needed to track changed files
  2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
                   ` (4 preceding siblings ...)
  2019-10-01  7:40 ` [PATCH v3 05/13] jbd2: fast-commit recovery path changes Harshad Shirwadkar
@ 2019-10-01  7:40 ` Harshad Shirwadkar
  2019-10-16 18:26   ` Theodore Y. Ts'o
  2019-10-01  7:40 ` [PATCH v3 07/13] ext4: track changed files for fast commit Harshad Shirwadkar
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Harshad Shirwadkar @ 2019-10-01  7:40 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar

Ext4's fast commit feature tracks changed files and maintains them in
a queue. We also remember for each file the logical block range that
needs to be committed. This patch adds these fields to ext4_inode_info
and ext4_sb_info and also adds initialization calls.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h      | 60 +++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/ext4_jbd2.c | 20 +++++++++++++++
 fs/ext4/ext4_jbd2.h |  2 ++
 fs/ext4/ialloc.c    |  1 +
 fs/ext4/inode.c     |  1 +
 fs/ext4/super.c     |  7 ++++++
 6 files changed, 91 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index becbda38b7db..c36ec23046f3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -921,6 +921,48 @@ enum {
 	I_DATA_SEM_QUOTA,
 };
 
+/*
+ * Ext4 fast commit inode specific information
+ */
+struct ext4_fast_commit_inode_info {
+	/*
+	 * TID of when this struct was last updated. If fc_tid !=
+	 * running transaction tid, then none of the other fields in this struct
+	 * are valid. Don't directly modify fields in this struct. Use wrappers
+	 * provided in ext4_jbd2.c.
+	 */
+	tid_t fc_tid;
+	/*
+	 * Start of logical block range that needs to be committed in this fast
+	 * commit
+	 */
+	ext4_lblk_t fc_lblk_start;
+
+	/*
+	 * End of logical block range that needs to be committed in this fast
+	 * commit
+	 */
+	ext4_lblk_t fc_lblk_end;
+
+	/*
+	 * Inode number of the directory that contains this inode. This field
+	 * is onlt valid if fc_new is set.
+	 */
+	u32 fc_parent_ino;
+
+	/*
+	 * Flag indicating whether this inode is eligible for fast commits or
+	 * not.
+	 */
+	bool fc_eligible;
+
+	/*
+	 * Flag indicating whether this inode is newly created during this
+	 * tid:subtid.
+	 */
+	bool fc_new;
+	rwlock_t fc_lock;
+};
 
 /*
  * fourth extended file system inode data in memory
@@ -955,6 +997,12 @@ struct ext4_inode_info {
 
 	struct list_head i_orphan;	/* unlinked but open inodes */
 
+	struct list_head i_fc_list;	/*
+					 * inodes that need fast commit
+					 * protected by sbi->s_fc_lock.
+					 */
+	struct ext4_fast_commit_inode_info i_fc;
+
 	/*
 	 * i_disksize keeps track of what the inode size is ON DISK, not
 	 * in memory.  During truncate, i_size is set to the new size by
@@ -1058,7 +1106,9 @@ struct ext4_inode_info {
 	 * fsync and fdatasync, respectively.
 	 */
 	tid_t i_sync_tid;
+	tid_t i_sync_subtid;
 	tid_t i_datasync_tid;
+	tid_t i_datasync_subtid;
 
 #ifdef CONFIG_QUOTA
 	struct dquot *i_dquot[MAXQUOTAS];
@@ -1529,6 +1579,16 @@ struct ext4_sb_info {
 	/* Barrier between changing inodes' journal flags and writepages ops. */
 	struct percpu_rw_semaphore s_journal_flag_rwsem;
 	struct dax_device *s_daxdev;
+
+	/* Ext4 fast commit stuff */
+	bool s_fc_replay;		/* Fast commit replay in progress */
+	struct list_head s_fc_q;	/* Inodes that need fast commit. */
+	__u32 s_fc_q_cnt;		/* Number of inodes in the fc queue */
+	bool s_fc_eligible;		/*
+					 * Are changes after the last commit
+					 * eligible for fast commit?
+					 */
+	spinlock_t s_fc_lock;
 };
 
 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 7c70b08d104c..9066bcfbee29 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -330,3 +330,23 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
 		mark_buffer_dirty(bh);
 	return err;
 }
+
+static inline
+void ext4_reset_inode_fc_info(struct ext4_fast_commit_inode_info *i_fc)
+{
+	i_fc->fc_tid = 0;
+	i_fc->fc_lblk_start = 0;
+	i_fc->fc_lblk_end = 0;
+	i_fc->fc_parent_ino = 0;
+	i_fc->fc_eligible = false;
+	i_fc->fc_new = false;
+}
+
+void ext4_init_inode_fc_info(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	ext4_reset_inode_fc_info(&ei->i_fc);
+	INIT_LIST_HEAD(&ei->i_fc_list);
+	rwlock_init(&ei->i_fc.fc_lock);
+}
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index ef8fcf7d0d3b..2305c1acd415 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -459,4 +459,6 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
 	return 1;
 }
 
+void ext4_init_inode_fc_info(struct inode *inode);
+
 #endif	/* _EXT4_JBD2_H */
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 764ff4c56233..ff30f3015551 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1131,6 +1131,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 
 	ext4_clear_state_flags(ei); /* Only relevant on 32-bit archs */
 	ext4_set_inode_state(inode, EXT4_STATE_NEW);
+	ext4_init_inode_fc_info(inode);
 
 	ei->i_extra_isize = sbi->s_want_extra_isize;
 	ei->i_inline_off = 0;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 420fe3deed39..f230a888eddd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4996,6 +4996,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 	for (block = 0; block < EXT4_N_BLOCKS; block++)
 		ei->i_data[block] = raw_inode->i_block[block];
 	INIT_LIST_HEAD(&ei->i_orphan);
+	ext4_init_inode_fc_info(&ei->vfs_inode);
 
 	/*
 	 * Set transaction id's of transactions that have to be committed
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7725eb2105f4..c90337fc98c1 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1100,6 +1100,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ei->i_datasync_tid = 0;
 	atomic_set(&ei->i_unwritten, 0);
 	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
+	ext4_init_inode_fc_info(&ei->vfs_inode);
 	return &ei->vfs_inode;
 }
 
@@ -1139,6 +1140,7 @@ static void init_once(void *foo)
 	init_rwsem(&ei->i_data_sem);
 	init_rwsem(&ei->i_mmap_sem);
 	inode_init_once(&ei->vfs_inode);
+	ext4_init_inode_fc_info(&ei->vfs_inode);
 }
 
 static int __init init_inodecache(void)
@@ -4301,6 +4303,11 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
 	mutex_init(&sbi->s_orphan_lock);
 
+	INIT_LIST_HEAD(&sbi->s_fc_q);
+	sbi->s_fc_q_cnt = 0;
+	sbi->s_fc_eligible = true;
+	spin_lock_init(&sbi->s_fc_lock);
+
 	sb->s_root = NULL;
 
 	needs_recovery = (es->s_last_orphan != 0 ||
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 07/13] ext4: track changed files for fast commit
  2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
                   ` (5 preceding siblings ...)
  2019-10-01  7:40 ` [PATCH v3 06/13] ext4: add fields that are needed to track changed files Harshad Shirwadkar
@ 2019-10-01  7:40 ` Harshad Shirwadkar
  2019-10-16 20:26   ` Theodore Y. Ts'o
  2019-10-01  7:40 ` [PATCH v3 08/13] ext4: fast-commit commit range tracking Harshad Shirwadkar
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Harshad Shirwadkar @ 2019-10-01  7:40 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar

For fast commit, we need to remember all the files that have changed
since last fast commit / full commit. For changes that are fast commit
incompatible, we mark the file system fast commit incompatible. This
patch adds code to either remember files that have changed or to mark
ext4 as fast commit ineligible. We inspect every ext4_mark_inode_dirty
calls and decide whether that particular file change is fast
compatible or not.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/acl.c       |  1 +
 fs/ext4/ext4_jbd2.c | 96 +++++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/ext4_jbd2.h | 44 +++++++++++++++++++++
 fs/ext4/extents.c   | 16 +++++++-
 fs/ext4/ialloc.c    |  8 ++++
 fs/ext4/inline.c    | 10 +++++
 fs/ext4/inode.c     | 24 +++++++++++-
 fs/ext4/ioctl.c     |  3 ++
 fs/ext4/migrate.c   |  1 +
 fs/ext4/namei.c     | 12 +++++-
 fs/ext4/super.c     |  5 +++
 fs/ext4/xattr.c     |  1 +
 12 files changed, 216 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index 8c7bbf3e566d..e84be9c315db 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -257,6 +257,7 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 		inode->i_mode = mode;
 		inode->i_ctime = current_time(inode);
 		ext4_mark_inode_dirty(handle, inode);
+		ext4_fc_enqueue_inode(handle, inode);
 	}
 out_stop:
 	ext4_journal_stop(handle);
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 9066bcfbee29..e70ad7a8e46e 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -331,6 +331,13 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
 	return err;
 }
 
+static inline tid_t get_running_txn_tid(struct super_block *sb)
+{
+	if (EXT4_SB(sb)->s_journal)
+		return EXT4_SB(sb)->s_journal->j_commit_sequence + 1;
+	return 0;
+}
+
 static inline
 void ext4_reset_inode_fc_info(struct ext4_fast_commit_inode_info *i_fc)
 {
@@ -350,3 +357,92 @@ void ext4_init_inode_fc_info(struct inode *inode)
 	INIT_LIST_HEAD(&ei->i_fc_list);
 	rwlock_init(&ei->i_fc.fc_lock);
 }
+
+void ext4_fc_enqueue_inode(handle_t *handle, struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+	tid_t running_txn_tid = get_running_txn_tid(inode->i_sb);
+
+	if (!ext4_should_fast_commit(inode->i_sb))
+		return;
+
+	spin_lock(&sbi->s_fc_lock);
+	if (!sbi->s_fc_eligible) {
+		spin_unlock(&sbi->s_fc_lock);
+		return;
+	}
+	if (list_empty(&EXT4_I(inode)->i_fc_list)) {
+		list_add(&EXT4_I(inode)->i_fc_list, &sbi->s_fc_q);
+		sbi->s_fc_q_cnt++;
+	}
+	spin_unlock(&sbi->s_fc_lock);
+
+	write_lock(&ei->i_fc.fc_lock);
+	if (ei->i_fc.fc_tid == running_txn_tid) {
+		write_unlock(&ei->i_fc.fc_lock);
+		return;
+	}
+
+	ext4_reset_inode_fc_info(&ei->i_fc);
+	ei->i_fc.fc_lblk_start = i_size_read(inode);
+	ei->i_fc.fc_lblk_end = i_size_read(inode);
+	ei->i_fc.fc_eligible = true;
+	ei->i_fc.fc_tid = running_txn_tid;
+	write_unlock(&ei->i_fc.fc_lock);
+}
+
+void ext4_fc_del(struct inode *inode)
+{
+	if (!ext4_should_fast_commit(inode->i_sb))
+		return;
+
+	if (list_empty(&EXT4_I(inode)->i_fc_list))
+		return;
+
+	spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+	list_del_init(&EXT4_I(inode)->i_fc_list);
+	spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+}
+
+void ext4_fc_mark_new(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	tid_t running_txn_tid = get_running_txn_tid(inode->i_sb);
+
+	write_lock(&ei->i_fc.fc_lock);
+	if (ei->i_fc.fc_tid != running_txn_tid) {
+		ext4_reset_inode_fc_info(&ei->i_fc);
+		ei->i_fc.fc_tid = running_txn_tid;
+		ei->i_fc.fc_eligible = true;
+	}
+	ei->i_fc.fc_new = true;
+	write_unlock(&ei->i_fc.fc_lock);
+}
+
+bool ext4_is_inode_fc_ineligible(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	tid_t running_txn_tid = get_running_txn_tid(inode->i_sb);
+	bool ret = false;
+
+	read_lock(&ei->i_fc.fc_lock);
+	if (running_txn_tid == ei->i_fc.fc_tid)
+		ret = !ei->i_fc.fc_eligible;
+	read_unlock(&ei->i_fc.fc_lock);
+	return ret;
+}
+
+bool ext4_is_inode_fc_new(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	tid_t running_txn_tid = get_running_txn_tid(inode->i_sb);
+	bool ret = false;
+
+	read_lock(&ei->i_fc.fc_lock);
+	if (running_txn_tid == ei->i_fc.fc_tid)
+		ret = ei->i_fc.fc_new;
+	read_unlock(&ei->i_fc.fc_lock);
+
+	return ret;
+}
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 2305c1acd415..65f20fbfb002 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -378,6 +378,17 @@ static inline int ext4_jbd2_inode_add_wait(handle_t *handle,
 	return 0;
 }
 
+static inline int ext4_should_fast_commit(struct super_block *sb)
+{
+	if (!ext4_has_feature_fast_commit(sb))
+		return 0;
+	if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
+		return 0;
+	if (test_opt(sb, QUOTA))
+		return 0;
+	return 1;
+}
+
 static inline void ext4_update_inode_fsync_trans(handle_t *handle,
 						 struct inode *inode,
 						 int datasync)
@@ -460,5 +471,38 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
 }
 
 void ext4_init_inode_fc_info(struct inode *inode);
+extern void ext4_fc_enqueue_inode(handle_t *handle, struct inode *inode);
+extern void ext4_fc_del(struct inode *inode);
+
+static inline void
+ext4_fc_mark_sb_ineligible(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+	spin_lock(&sbi->s_fc_lock);
+	sbi->s_fc_eligible = false;
+	spin_unlock(&sbi->s_fc_lock);
+}
+
+
+static inline void
+ext4_fc_mark_ineligible(struct inode *inode)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	write_lock(&ei->i_fc.fc_lock);
+	if (sbi->s_journal)
+		ei->i_fc.fc_tid = sbi->s_journal->j_commit_sequence + 1;
+	ei->i_fc.fc_eligible = false;
+	write_unlock(&ei->i_fc.fc_lock);
+	spin_lock(&sbi->s_fc_lock);
+	sbi->s_fc_eligible = false;
+	spin_unlock(&sbi->s_fc_lock);
+}
+
 
+void ext4_fc_mark_new(struct inode *inode);
+bool ext4_is_inode_fc_ineligible(struct inode *inode);
+bool ext4_is_inode_fc_new(struct inode *inode);
 #endif	/* _EXT4_JBD2_H */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 92266a2da7d6..b30f6175eb71 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -163,6 +163,7 @@ int __ext4_ext_dirty(const char *where, unsigned int line, handle_t *handle,
 	} else {
 		/* path points to leaf/index in inode body */
 		err = ext4_mark_inode_dirty(handle, inode);
+		ext4_fc_enqueue_inode(handle, inode);
 	}
 	return err;
 }
@@ -3714,6 +3715,8 @@ static int ext4_ext_convert_to_initialized(handle_t *handle,
 		err = ext4_zeroout_es(inode, &zero_ex1);
 		if (!err)
 			err = ext4_zeroout_es(inode, &zero_ex2);
+	} else {
+		ext4_fc_mark_ineligible(inode);
 	}
 	return err ? err : allocated;
 }
@@ -3856,7 +3859,7 @@ static int check_eofblocks_fl(handle_t *handle, struct inode *inode,
 			      struct ext4_ext_path *path,
 			      unsigned int len)
 {
-	int i, depth;
+	int i, ret, depth;
 	struct ext4_extent_header *eh;
 	struct ext4_extent *last_ex;
 
@@ -3898,7 +3901,10 @@ static int check_eofblocks_fl(handle_t *handle, struct inode *inode,
 			return 0;
 out:
 	ext4_clear_inode_flag(inode, EXT4_INODE_EOFBLOCKS);
-	return ext4_mark_inode_dirty(handle, inode);
+	ret = ext4_mark_inode_dirty(handle, inode);
+	ext4_fc_enqueue_inode(handle, inode);
+
+	return ret;
 }
 
 static int
@@ -4607,6 +4613,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 				   inode->i_ino, map.m_lblk,
 				   map.m_len, ret);
 			ext4_mark_inode_dirty(handle, inode);
+			ext4_fc_enqueue_inode(handle, inode);
 			ret2 = ext4_journal_stop(handle);
 			break;
 		}
@@ -4624,6 +4631,7 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
 				ext4_set_inode_flag(inode,
 						    EXT4_INODE_EOFBLOCKS);
 		}
+		ext4_fc_enqueue_inode(handle, inode);
 		ext4_mark_inode_dirty(handle, inode);
 		ext4_update_inode_fsync_trans(handle, inode, 1);
 		ret2 = ext4_journal_stop(handle);
@@ -4786,6 +4794,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 			ext4_set_inode_flag(inode, EXT4_INODE_EOFBLOCKS);
 	}
 	ext4_mark_inode_dirty(handle, inode);
+	ext4_fc_enqueue_inode(handle, inode);
 
 	/* Zero out partial block at the edges of the range */
 	ret = ext4_zero_partial_blocks(handle, inode, offset, len);
@@ -4957,6 +4966,7 @@ int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
 				     "ext4_ext_map_blocks returned %d",
 				     inode->i_ino, map.m_lblk,
 				     map.m_len, ret);
+		ext4_fc_mark_ineligible(inode);
 		ext4_mark_inode_dirty(handle, inode);
 		if (credits)
 			ret2 = ext4_journal_stop(handle);
@@ -5485,6 +5495,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
 	if (IS_SYNC(inode))
 		ext4_handle_sync(handle);
 	inode->i_mtime = inode->i_ctime = current_time(inode);
+	ext4_fc_mark_ineligible(inode);
 	ext4_mark_inode_dirty(handle, inode);
 	ext4_update_inode_fsync_trans(handle, inode, 1);
 
@@ -5599,6 +5610,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
 	inode->i_size += len;
 	EXT4_I(inode)->i_disksize += len;
 	inode->i_mtime = inode->i_ctime = current_time(inode);
+	ext4_fc_mark_ineligible(inode);
 	ret = ext4_mark_inode_dirty(handle, inode);
 	if (ret)
 		goto out_stop;
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index ff30f3015551..47d04a33a3ca 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1133,6 +1133,14 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 	ext4_set_inode_state(inode, EXT4_STATE_NEW);
 	ext4_init_inode_fc_info(inode);
 
+	if (S_ISDIR(mode) || ext4_is_inode_fc_ineligible(dir) ||
+	    ext4_is_inode_fc_new(dir)) {
+		ext4_fc_mark_ineligible(inode);
+	} else {
+		ext4_fc_mark_new(inode);
+		ei->i_fc.fc_parent_ino = dir->i_ino;
+	}
+
 	ei->i_extra_isize = sbi->s_want_extra_isize;
 	ei->i_inline_off = 0;
 	if (ext4_has_feature_inline_data(sb))
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 88cdf3c90bd1..fbd561cba098 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -435,6 +435,8 @@ static int ext4_destroy_inline_data_nolock(handle_t *handle,
 	if (error)
 		goto out;
 
+	ext4_fc_mark_ineligible(inode);
+
 	memset((void *)ext4_raw_inode(&is.iloc)->i_block,
 		0, EXT4_MIN_INLINE_DATA_SIZE);
 	memset(ei->i_data, 0, EXT4_MIN_INLINE_DATA_SIZE);
@@ -759,6 +761,7 @@ int ext4_write_inline_data_end(struct inode *inode, loff_t pos, unsigned len,
 
 	ext4_write_unlock_xattr(inode, &no_expand);
 	brelse(iloc.bh);
+	ext4_fc_enqueue_inode(ext4_journal_current_handle(), inode);
 	mark_inode_dirty(inode);
 out:
 	return copied;
@@ -974,6 +977,7 @@ int ext4_da_write_inline_data_end(struct inode *inode, loff_t pos,
 	 * ordering of page lock and transaction start for journaling
 	 * filesystems.
 	 */
+	ext4_fc_enqueue_inode(ext4_journal_current_handle(), inode);
 	mark_inode_dirty(inode);
 
 	return copied;
@@ -1165,6 +1169,7 @@ static int ext4_finish_convert_inline_dir(handle_t *handle,
 	if (err)
 		return err;
 	set_buffer_verified(dir_block);
+	ext4_fc_mark_ineligible(inode);
 	return ext4_mark_inode_dirty(handle, inode);
 }
 
@@ -1216,6 +1221,8 @@ static int ext4_convert_inline_data_nolock(handle_t *handle,
 		goto out_restore;
 	}
 
+	ext4_fc_mark_ineligible(inode);
+
 	data_bh = sb_getblk(inode->i_sb, map.m_pblk);
 	if (!data_bh) {
 		error = -ENOMEM;
@@ -1709,6 +1716,8 @@ int ext4_delete_inline_entry(handle_t *handle,
 	if (err)
 		goto out;
 
+	ext4_fc_enqueue_inode(handle, dir);
+
 	ext4_show_inline_dir(dir, iloc.bh, inline_start, inline_size);
 out:
 	ext4_write_unlock_xattr(dir, &no_expand);
@@ -1986,6 +1995,7 @@ int ext4_inline_data_truncate(struct inode *inode, int *has_inline)
 
 	if (err == 0) {
 		inode->i_mtime = inode->i_ctime = current_time(inode);
+		ext4_fc_enqueue_inode(handle, inode);
 		err = ext4_mark_inode_dirty(handle, inode);
 		if (IS_SYNC(inode))
 			ext4_handle_sync(handle);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f230a888eddd..6d2efbd9aba9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -262,6 +262,7 @@ void ext4_evict_inode(struct inode *inode)
 		 * cleaned up.
 		 */
 		ext4_orphan_del(NULL, inode);
+		ext4_fc_del(inode);
 		sb_end_intwrite(inode->i_sb);
 		goto no_delete;
 	}
@@ -279,6 +280,8 @@ void ext4_evict_inode(struct inode *inode)
 	if (ext4_inode_is_fast_symlink(inode))
 		memset(EXT4_I(inode)->i_data, 0, sizeof(EXT4_I(inode)->i_data));
 	inode->i_size = 0;
+	ext4_fc_del(inode);
+	ext4_fc_mark_ineligible(inode);
 	err = ext4_mark_inode_dirty(handle, inode);
 	if (err) {
 		ext4_warning(inode->i_sb,
@@ -303,6 +306,7 @@ void ext4_evict_inode(struct inode *inode)
 stop_handle:
 		ext4_journal_stop(handle);
 		ext4_orphan_del(NULL, inode);
+		ext4_fc_del(inode);
 		sb_end_intwrite(inode->i_sb);
 		ext4_xattr_inode_array_free(ea_inode_array);
 		goto no_delete;
@@ -326,6 +330,8 @@ void ext4_evict_inode(struct inode *inode)
 	 * having errors), but we can't free the inode if the mark_dirty
 	 * fails.
 	 */
+	ext4_fc_del(inode);
+	ext4_fc_mark_ineligible(inode);
 	if (ext4_mark_inode_dirty(handle, inode))
 		/* If that failed, just do the required in-core inode clear. */
 		ext4_clear_inode(inode);
@@ -1436,8 +1442,10 @@ static int ext4_write_end(struct file *file,
 	 * ordering of page lock and transaction start for journaling
 	 * filesystems.
 	 */
-	if (i_size_changed || inline_data)
+	if (i_size_changed || inline_data) {
 		ext4_mark_inode_dirty(handle, inode);
+		ext4_fc_enqueue_inode(handle, inode);
+	}
 
 	if (pos + len > inode->i_size && ext4_can_truncate(inode))
 		/* if we have allocated more blocks and copied
@@ -1550,6 +1558,7 @@ static int ext4_journalled_write_end(struct file *file,
 		pagecache_isize_extended(inode, old_size, pos);
 
 	if (size_changed || inline_data) {
+		ext4_fc_enqueue_inode(handle, inode);
 		ret2 = ext4_mark_inode_dirty(handle, inode);
 		if (!ret)
 			ret = ret2;
@@ -2077,6 +2086,7 @@ static int __ext4_journalled_writepage(struct page *page,
 
 	if (inline_data) {
 		ret = ext4_mark_inode_dirty(handle, inode);
+		ext4_fc_enqueue_inode(handle, inode);
 	} else {
 		ret = ext4_walk_page_buffers(handle, page_bufs, 0, len, NULL,
 					     do_journal_get_write_access);
@@ -2604,6 +2614,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 			EXT4_I(inode)->i_disksize = disksize;
 		up_write(&EXT4_I(inode)->i_data_sem);
 		err2 = ext4_mark_inode_dirty(handle, inode);
+		ext4_fc_enqueue_inode(handle, inode);
 		if (err2)
 			ext4_error(inode->i_sb,
 				   "Failed to mark inode %lu dirty",
@@ -3205,6 +3216,7 @@ static int ext4_da_write_end(struct file *file,
 			 * bu greater than i_disksize.(hint delalloc)
 			 */
 			ext4_mark_inode_dirty(handle, inode);
+			ext4_fc_enqueue_inode(handle, inode);
 		}
 	}
 
@@ -3614,8 +3626,12 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 		ret = PTR_ERR(handle);
 		goto orphan_del;
 	}
-	if (ext4_update_inode_size(inode, offset + written))
+
+	if (ext4_update_inode_size(inode, offset + written)) {
 		ext4_mark_inode_dirty(handle, inode);
+		ext4_fc_enqueue_inode(handle, inode);
+	}
+
 	/*
 	 * We may need to truncate allocated but not written blocks beyond EOF.
 	 */
@@ -3851,6 +3867,7 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
 				 * ignore it.
 				 */
 				ext4_mark_inode_dirty(handle, inode);
+				ext4_fc_enqueue_inode(handle, inode);
 			}
 		}
 		err = ext4_journal_stop(handle);
@@ -4372,6 +4389,8 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 		goto out_dio;
 	}
 
+	ext4_fc_mark_ineligible(inode);
+
 	ret = ext4_zero_partial_blocks(handle, inode, offset,
 				       length);
 	if (ret)
@@ -4525,6 +4544,7 @@ int ext4_truncate(struct inode *inode)
 	if (inode->i_size & (inode->i_sb->s_blocksize - 1))
 		ext4_block_truncate_page(handle, mapping, inode->i_size);
 
+	ext4_fc_mark_ineligible(inode);
 	/*
 	 * We add the inode to the orphan list, so that if this
 	 * truncate spans multiple transactions, and we crash, we will
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 442f7ef873fc..a8e23acb5c03 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -987,6 +987,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		err = mnt_want_write_file(filp);
 		if (err)
 			return err;
+		ext4_fc_mark_sb_ineligible(sb);
 		err = swap_inode_boot_loader(sb, inode);
 		mnt_drop_write_file(filp);
 		return err;
@@ -997,6 +998,8 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		int err = 0, err2 = 0;
 		ext4_group_t o_group = EXT4_SB(sb)->s_groups_count;
 
+		ext4_fc_mark_sb_ineligible(sb);
+
 		if (copy_from_user(&n_blocks_count, (__u64 __user *)arg,
 				   sizeof(__u64))) {
 			return -EFAULT;
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index b1e4d359f73b..b995690d73ce 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -513,6 +513,7 @@ int ext4_ext_migrate(struct inode *inode)
 		 * work to orphan_list_cleanup()
 		 */
 		ext4_orphan_del(NULL, tmp_inode);
+		ext4_fc_del(inode);
 		retval = PTR_ERR(handle);
 		goto out;
 	}
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 129029534075..8b73c5a38d49 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2140,8 +2140,10 @@ static int make_indexed_dir(handle_t *handle, struct ext4_filename *fname,
 	 * out all the changes we did so far. Otherwise we can end up
 	 * with corrupted filesystem.
 	 */
-	if (retval)
+	if (retval) {
 		ext4_mark_inode_dirty(handle, dir);
+		ext4_fc_mark_ineligible(dir);
+	}
 	dx_release(frames);
 	brelse(bh2);
 	return retval;
@@ -2661,6 +2663,7 @@ static int ext4_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode)
 		err = ext4_orphan_add(handle, inode);
 		if (err)
 			goto err_unlock_inode;
+		ext4_fc_enqueue_inode(handle, inode);
 		mark_inode_dirty(inode);
 		unlock_new_inode(inode);
 	}
@@ -2773,6 +2776,7 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 	err = ext4_init_new_dir(handle, dir, inode);
 	if (err)
 		goto out_clear_inode;
+	ext4_fc_mark_ineligible(inode);
 	err = ext4_mark_inode_dirty(handle, inode);
 	if (!err)
 		err = ext4_add_entry(handle, dentry, inode);
@@ -3114,6 +3118,7 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
 	inode->i_size = 0;
 	ext4_orphan_add(handle, inode);
 	inode->i_ctime = dir->i_ctime = dir->i_mtime = current_time(inode);
+	ext4_fc_mark_ineligible(inode);
 	ext4_mark_inode_dirty(handle, inode);
 	ext4_dec_count(handle, dir);
 	ext4_update_dx_flag(dir);
@@ -3192,6 +3197,7 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
 		goto end_unlink;
 	dir->i_ctime = dir->i_mtime = current_time(dir);
 	ext4_update_dx_flag(dir);
+	ext4_fc_mark_ineligible(dir);
 	ext4_mark_inode_dirty(handle, dir);
 	drop_nlink(inode);
 	if (!inode->i_nlink)
@@ -3387,6 +3393,7 @@ static int ext4_link(struct dentry *old_dentry,
 
 	err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
+		ext4_fc_mark_ineligible(inode);
 		ext4_mark_inode_dirty(handle, inode);
 		/* this can happen only for tmpfile being
 		 * linked the first time
@@ -3991,6 +3998,9 @@ static int ext4_rename2(struct inode *old_dir, struct dentry *old_dentry,
 	if (err)
 		return err;
 
+	ext4_fc_mark_ineligible(old_dir);
+	ext4_fc_mark_ineligible(new_dir);
+
 	if (flags & RENAME_EXCHANGE) {
 		return ext4_cross_rename(old_dir, old_dentry,
 					 new_dir, new_dentry);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c90337fc98c1..3e9570ea9748 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1181,6 +1181,7 @@ void ext4_clear_inode(struct inode *inode)
 		EXT4_I(inode)->jinode = NULL;
 	}
 	fscrypt_put_encryption_info(inode);
+	ext4_fc_del(inode);
 }
 
 static struct inode *ext4_nfs_get_inode(struct super_block *sb,
@@ -1325,6 +1326,7 @@ static int ext4_set_context(struct inode *inode, const void *ctx, size_t len,
 		 * S_DAX may be disabled
 		 */
 		ext4_set_inode_flags(inode);
+		ext4_fc_mark_ineligible(inode);
 		res = ext4_mark_inode_dirty(handle, inode);
 		if (res)
 			EXT4_ERROR_INODE(inode, "Failed to mark inode dirty");
@@ -5797,6 +5799,7 @@ static int ext4_quota_on(struct super_block *sb, int type, int format_id,
 		EXT4_I(inode)->i_flags |= EXT4_NOATIME_FL | EXT4_IMMUTABLE_FL;
 		inode_set_flags(inode, S_NOATIME | S_IMMUTABLE,
 				S_NOATIME | S_IMMUTABLE);
+		ext4_fc_mark_ineligible(inode);
 		ext4_mark_inode_dirty(handle, inode);
 		ext4_journal_stop(handle);
 	unlock_inode:
@@ -5904,6 +5907,7 @@ static int ext4_quota_off(struct super_block *sb, int type)
 	EXT4_I(inode)->i_flags &= ~(EXT4_NOATIME_FL | EXT4_IMMUTABLE_FL);
 	inode_set_flags(inode, 0, S_NOATIME | S_IMMUTABLE);
 	inode->i_mtime = inode->i_ctime = current_time(inode);
+	ext4_fc_mark_ineligible(inode);
 	ext4_mark_inode_dirty(handle, inode);
 	ext4_journal_stop(handle);
 out_unlock:
@@ -6010,6 +6014,7 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type,
 	if (inode->i_size < off + len) {
 		i_size_write(inode, off + len);
 		EXT4_I(inode)->i_disksize = inode->i_size;
+		ext4_fc_mark_ineligible(inode);
 		ext4_mark_inode_dirty(handle, inode);
 	}
 	return len;
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 491f9ee4040e..19bc4046658c 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1406,6 +1406,7 @@ static int ext4_xattr_inode_write(handle_t *handle, struct inode *ea_inode,
 	inode_unlock(ea_inode);
 
 	ext4_mark_inode_dirty(handle, ea_inode);
+	ext4_fc_enqueue_inode(handle, ea_inode);
 
 out:
 	brelse(bh);
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 08/13] ext4: fast-commit commit range tracking
  2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
                   ` (6 preceding siblings ...)
  2019-10-01  7:40 ` [PATCH v3 07/13] ext4: track changed files for fast commit Harshad Shirwadkar
@ 2019-10-01  7:40 ` Harshad Shirwadkar
  2019-10-16 21:36   ` Theodore Y. Ts'o
  2019-10-01  7:40 ` [PATCH v3 09/13] ext4: fast-commit commit path changes Harshad Shirwadkar
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Harshad Shirwadkar @ 2019-10-01  7:40 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar

With this patch, we track logical range of file offsets that need to
be committed using fast commit. This allows us to find file extents
that need to be committed during the commit time.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4_jbd2.c | 34 ++++++++++++++++++++++++++++++++++
 fs/ext4/ext4_jbd2.h |  2 ++
 fs/ext4/inline.c    |  4 +++-
 fs/ext4/inode.c     | 17 ++++++++++++++++-
 4 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index e70ad7a8e46e..0bb8de2139a5 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -405,6 +405,40 @@ void ext4_fc_del(struct inode *inode)
 	spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
 }
 
+void ext4_fc_update_commit_range(struct inode *inode, ext4_lblk_t start,
+				 ext4_lblk_t end)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	tid_t running_txn_tid = get_running_txn_tid(inode->i_sb);
+
+	if (!ext4_should_fast_commit(inode->i_sb))
+		return;
+
+	if (inode->i_ino < EXT4_FIRST_INO(inode->i_sb))
+		ext4_debug("Special inode %ld being modified\n", inode->i_ino);
+
+	if (!EXT4_SB(inode->i_sb)->s_fc_eligible)
+		return;
+
+	write_lock(&ei->i_fc.fc_lock);
+	if (ei->i_fc.fc_tid == running_txn_tid) {
+		ei->i_fc.fc_lblk_start = ei->i_fc.fc_lblk_start < start ?
+					 ei->i_fc.fc_lblk_start : start;
+		ei->i_fc.fc_lblk_end = ei->i_fc.fc_lblk_end > end ?
+				     ei->i_fc.fc_lblk_end : end;
+		write_unlock(&ei->i_fc.fc_lock);
+		return;
+	}
+
+	ext4_reset_inode_fc_info(&ei->i_fc);
+	ei->i_fc.fc_eligible = true;
+	ei->i_fc.fc_lblk_start = start;
+	ei->i_fc.fc_lblk_end = end;
+	ei->i_fc.fc_tid = running_txn_tid;
+	write_unlock(&ei->i_fc.fc_lock);
+
+}
+
 void ext4_fc_mark_new(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 65f20fbfb002..2cb7e7e1f025 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -501,6 +501,8 @@ ext4_fc_mark_ineligible(struct inode *inode)
 	spin_unlock(&sbi->s_fc_lock);
 }
 
+void ext4_fc_update_commit_range(struct inode *inode, ext4_lblk_t start,
+				 ext4_lblk_t end);
 
 void ext4_fc_mark_new(struct inode *inode);
 bool ext4_is_inode_fc_ineligible(struct inode *inode);
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index fbd561cba098..66b2c0e3f7e4 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -966,8 +966,10 @@ int ext4_da_write_inline_data_end(struct inode *inode, loff_t pos,
 	 * But it's important to update i_size while still holding page lock:
 	 * page writeout could otherwise come in and zero beyond i_size.
 	 */
-	if (pos+copied > inode->i_size)
+	if (pos+copied > inode->i_size) {
+		ext4_fc_update_commit_range(inode, inode->i_size, pos + copied);
 		i_size_write(inode, pos+copied);
+	}
 	unlock_page(page);
 	put_page(page);
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6d2efbd9aba9..ea039e3e1a4d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1549,6 +1549,8 @@ static int ext4_journalled_write_end(struct file *file,
 			SetPageUptodate(page);
 	}
 	size_changed = ext4_update_inode_size(inode, pos + copied);
+	ext4_fc_update_commit_range(inode, pos, pos + copied);
+
 	ext4_set_inode_state(inode, EXT4_STATE_JDATA);
 	EXT4_I(inode)->i_datasync_tid = handle->h_transaction->t_tid;
 	unlock_page(page);
@@ -2610,8 +2612,12 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 		i_size = i_size_read(inode);
 		if (disksize > i_size)
 			disksize = i_size;
-		if (disksize > EXT4_I(inode)->i_disksize)
+		if (disksize > EXT4_I(inode)->i_disksize) {
+			ext4_fc_update_commit_range(inode,
+						    EXT4_I(inode)->i_disksize,
+						    disksize);
 			EXT4_I(inode)->i_disksize = disksize;
+		}
 		up_write(&EXT4_I(inode)->i_data_sem);
 		err2 = ext4_mark_inode_dirty(handle, inode);
 		ext4_fc_enqueue_inode(handle, inode);
@@ -3220,6 +3226,8 @@ static int ext4_da_write_end(struct file *file,
 		}
 	}
 
+	ext4_fc_update_commit_range(inode, pos, pos + copied);
+
 	if (write_mode != CONVERT_INLINE_DATA &&
 	    ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) &&
 	    ext4_has_inline_data(inode))
@@ -3627,6 +3635,7 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 		goto orphan_del;
 	}
 
+	ext4_fc_update_commit_range(inode, offset, offset + written);
 	if (ext4_update_inode_size(inode, offset + written)) {
 		ext4_mark_inode_dirty(handle, inode);
 		ext4_fc_enqueue_inode(handle, inode);
@@ -3751,6 +3760,7 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
 		ext4_update_i_disksize(inode, inode->i_size);
 		ext4_journal_stop(handle);
 	}
+	ext4_fc_update_commit_range(inode, offset, offset + count);
 
 	BUG_ON(iocb->private == NULL);
 
@@ -3869,6 +3879,8 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
 				ext4_mark_inode_dirty(handle, inode);
 				ext4_fc_enqueue_inode(handle, inode);
 			}
+			ext4_fc_update_commit_range(inode, offset,
+						    offset + end);
 		}
 		err = ext4_journal_stop(handle);
 		if (ret == 0)
@@ -5327,6 +5339,9 @@ static int ext4_do_update_inode(handle_t *handle,
 			cpu_to_le16(ei->i_file_acl >> 32);
 	raw_inode->i_file_acl_lo = cpu_to_le32(ei->i_file_acl);
 	if (ei->i_disksize != ext4_isize(inode->i_sb, raw_inode)) {
+		ext4_fc_update_commit_range(inode,
+					    ext4_isize(inode->i_sb, raw_inode),
+					    ei->i_disksize);
 		ext4_isize_set(raw_inode, ei->i_disksize);
 		need_datasync = 1;
 	}
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 09/13] ext4: fast-commit commit path changes
  2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
                   ` (7 preceding siblings ...)
  2019-10-01  7:40 ` [PATCH v3 08/13] ext4: fast-commit commit range tracking Harshad Shirwadkar
@ 2019-10-01  7:40 ` Harshad Shirwadkar
  2019-10-16 22:45   ` Theodore Y. Ts'o
  2019-10-01  7:40 ` [PATCH v3 10/13] ext4: fast-commit recovery " Harshad Shirwadkar
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Harshad Shirwadkar @ 2019-10-01  7:40 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar

This patch implements the actual commit path for fast commit. Based on
inodes tracked and their respective changes remembered, this
patch adds code to create a fast commit block that stores extents
added as well as dentrys created for the inode. We use new JBD2
interfaces added in previous patches in this series. The fast commit
blocks that are created have extents that _should_ be present in the
file. It doesn't yet support removing of extents, making operations
such as truncate, delete fast commit incompatible.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4_jbd2.c         | 309 ++++++++++++++++++++++++++++++++++++
 fs/ext4/ext4_jbd2.h         |  50 +++++-
 fs/ext4/extents.c           |   8 +-
 fs/ext4/inode.c             |  22 ++-
 fs/ext4/super.c             |  11 ++
 include/trace/events/ext4.h |  39 +++++
 6 files changed, 429 insertions(+), 10 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 0bb8de2139a5..fd7740372438 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -4,6 +4,7 @@
  */
 
 #include "ext4_jbd2.h"
+#include "ext4_extents.h"
 
 #include <trace/events/ext4.h>
 
@@ -480,3 +481,311 @@ bool ext4_is_inode_fc_new(struct inode *inode)
 
 	return ret;
 }
+
+static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
+{
+	struct buffer_head *orig_bh = bh->b_private;
+
+	BUFFER_TRACE(bh, "");
+	if (uptodate) {
+		ext4_debug("%s: Block %lld up-to-date",
+			   __func__, bh->b_blocknr);
+		set_buffer_uptodate(bh);
+	} else {
+		ext4_debug("%s: Block %lld not up-to-date",
+			   __func__, bh->b_blocknr);
+		clear_buffer_uptodate(bh);
+	}
+	if (orig_bh) {
+		clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
+		/* Protect BH_Shadow bit in b_state */
+		smp_mb__after_atomic();
+		wake_up_bit(&orig_bh->b_state, BH_Shadow);
+	}
+	unlock_buffer(bh);
+}
+
+static inline u8 *fc_add_tag(u8 *dst, u16 tag, u16 len, u8 *val)
+{
+	struct ext4_fc_tl tl;
+
+	tl.fc_tag = cpu_to_le16(tag);
+	tl.fc_len = cpu_to_le16(len);
+	memcpy(dst, &tl, sizeof(tl));
+	memcpy(dst + sizeof(tl), val, len);
+
+	return dst + sizeof(tl) + len;
+}
+
+int ext4_fc_write_inode(journal_t *journal, struct buffer_head *bh,
+			struct inode *inode, tid_t tid, tid_t subtid,
+			int is_last, struct dentry *dentry)
+{
+	ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
+	struct super_block *sb = journal->j_private;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_commit_hdr *fc_hdr;
+	struct ext4_map_blocks map;
+	struct ext4_iloc iloc;
+	struct ext4_extent extent;
+	struct inode *parent;
+	__u32 dummy_csum = 0, csum;
+	__u8 *start, *cur, *end;
+	__u16 num_tlvs = 0;
+	int ret;
+
+	read_lock(&ei->i_fc.fc_lock);
+	if (tid != ei->i_fc.fc_tid) {
+		jbd_debug(3,
+			  "File not modified. Modified %d, expected %d",
+			  ei->i_fc.fc_tid, tid);
+		read_unlock(&ei->i_fc.fc_lock);
+		return 0;
+	}
+	read_unlock(&ei->i_fc.fc_lock);
+
+	if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+		return -ECANCELED;
+
+	if (ext4_is_inode_fc_new(inode)) {
+		parent = d_inode(dentry->d_parent);
+		if (parent && ext4_is_inode_fc_ineligible(parent))
+			return -ECANCELED;
+	}
+
+	ret = ext4_get_inode_loc(inode, &iloc);
+	if (ret)
+		return ret;
+
+	end = (__u8 *)bh->b_data + journal->j_blocksize;
+
+	write_lock(&ei->i_fc.fc_lock);
+	old_blk_size = (ei->i_fc.fc_lblk_start + sb->s_blocksize - 1) >>
+		       inode->i_blkbits;
+	new_blk_size = ei->i_fc.fc_lblk_end >> inode->i_blkbits;
+	ei->i_fc.fc_lblk_start = ei->i_fc.fc_lblk_end;
+	write_unlock(&ei->i_fc.fc_lock);
+
+	jbd_debug(3, "Committing as tid = %d, subtid = %d on buffer %lld\n",
+		  tid, subtid, bh->b_blocknr);
+
+	fc_hdr = (struct ext4_fc_commit_hdr *)
+			((__u8 *)bh->b_data + sizeof(journal_header_t));
+	fc_hdr->fc_magic = cpu_to_le32(EXT4_FC_MAGIC);
+	fc_hdr->fc_subtid = cpu_to_le32(subtid);
+	fc_hdr->fc_ino = cpu_to_le32(inode->i_ino);
+	fc_hdr->fc_features = 0;
+	fc_hdr->fc_flags = 0;
+
+	if (is_last)
+		ext4_fc_mark_last(fc_hdr);
+
+	memcpy(&fc_hdr->inode, ext4_raw_inode(&iloc), EXT4_INODE_SIZE(sb));
+	cur = (__u8 *)(fc_hdr + 1);
+	start = cur;
+	if (ext4_is_inode_fc_new(inode)) {
+		__le32 parent_ino;
+
+		read_lock(&ei->i_fc.fc_lock);
+		parent_ino = cpu_to_le32(ei->i_fc.fc_parent_ino);
+		read_unlock(&ei->i_fc.fc_lock);
+
+		if (!dentry)
+			return -ECANCELED;
+
+		cur = fc_add_tag(cur, EXT4_FC_TAG_PARENT_INO,
+				      sizeof(parent_ino), (u8 *)&parent_ino);
+		cur = fc_add_tag(cur, EXT4_FC_TAG_DNAME,
+				 dentry->d_name.len,
+				 (u8 *)dentry->d_name.name);
+		num_tlvs = 2;
+	}
+	csum = 0;
+	cur_lblk_off = old_blk_size;
+	while (cur_lblk_off <= new_blk_size) {
+		map.m_lblk = cur_lblk_off;
+		map.m_len = new_blk_size - cur_lblk_off + 1;
+		ret = ext4_map_blocks(NULL, inode, &map, 0);
+		if (!ret) {
+			cur_lblk_off += map.m_len;
+			continue;
+		}
+
+		if (map.m_flags & EXT4_MAP_UNWRITTEN)
+			return -ECANCELED;
+		extent.ee_block = cpu_to_le32(map.m_lblk);
+		cur_lblk_off += map.m_len;
+		if (cur + sizeof(struct ext4_extent) +
+		    sizeof(struct ext4_fc_tl) >= end)
+			return -ENOSPC;
+
+		extent.ee_len = cpu_to_le16(map.m_len);
+		ext4_ext_store_pblock(&extent, map.m_pblk);
+		ext4_ext_mark_initialized(&extent);
+		cur = fc_add_tag(cur, EXT4_FC_TAG_EXT,
+				 sizeof(struct ext4_extent),
+				 (u8 *)&extent);
+		num_tlvs++;
+	}
+
+	fc_hdr->fc_num_tlvs = cpu_to_le16(num_tlvs);
+	csum = ext4_chksum(sbi, csum, (__u8 *)fc_hdr,
+			   offsetof(struct ext4_fc_commit_hdr, fc_csum));
+	csum = ext4_chksum(sbi, csum, &dummy_csum, sizeof(dummy_csum));
+	csum = ext4_chksum(sbi, csum, start, cur - start);
+	fc_hdr->fc_csum = cpu_to_le32(csum);
+
+	jbd_debug(3, "Created FC block for inode %ld with [%d, %d]",
+		  inode->i_ino, tid, subtid);
+
+	return 1;
+}
+
+static void ext4_journal_fc_cleanup_cb(journal_t *journal)
+{
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct inode *inode;
+
+	spin_lock(&sbi->s_fc_lock);
+	while (!list_empty(&sbi->s_fc_q)) {
+		iter = list_first_entry(&sbi->s_fc_q,
+				  struct ext4_inode_info, i_fc_list);
+		list_del_init(&iter->i_fc_list);
+		inode = &iter->vfs_inode;
+	}
+	INIT_LIST_HEAD(&sbi->s_fc_q);
+	sbi->s_fc_q_cnt = 0;
+	spin_unlock(&sbi->s_fc_lock);
+	sbi->s_fc_eligible = true;
+}
+
+/*
+ * Fast-commit commit callback. There is contention between sbi->s_fc_lock and
+ * i_data_sem. Locking order is - i_data_sem then s_fc_lock
+ */
+static int ext4_journal_fc_commit_cb(journal_t *journal, tid_t tid,
+			      tid_t subtid,
+			      struct transaction_run_stats_s *stats)
+{
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct list_head *pos, *tmp;
+	struct ext4_inode_info *iter;
+	int num_bufs = 0, ret;
+
+	memset(stats, 0, sizeof(*stats));
+
+	trace_ext4_journal_fc_commit_cb_start(sb);
+	sbi = sbi;
+	spin_lock(&sbi->s_fc_lock);
+	if (!sbi->s_fc_eligible) {
+		sbi->s_fc_eligible = true;
+		spin_unlock(&sbi->s_fc_lock);
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "ineligible");
+		return -ECANCELED;
+	}
+
+	if (unlikely(ext4_forced_shutdown(EXT4_SB(sb)))) {
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "shutdown");
+		return -EIO;
+	}
+
+	stats->rs_flushing = jiffies;
+	/* Submit data buffers first */
+	list_for_each(pos, &sbi->s_fc_q) {
+		iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
+		ret = jbd2_submit_inode_data(journal, iter->jinode);
+		if (ret) {
+			spin_unlock(&sbi->s_fc_lock);
+			trace_ext4_journal_fc_commit_cb_stop(sb, 0,
+							     "data_commit");
+			return ret;
+		}
+	}
+	stats->rs_logging = jiffies;
+	stats->rs_flushing = jbd2_time_diff(stats->rs_flushing,
+					    stats->rs_logging);
+
+	list_for_each_safe(pos, tmp, &sbi->s_fc_q) {
+		struct inode *inode;
+		struct buffer_head *bh;
+		int is_last;
+
+		iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
+		inode = &iter->vfs_inode;
+
+		is_last = list_is_last(pos, &sbi->s_fc_q);
+		spin_unlock(&sbi->s_fc_lock);
+
+		ret = jbd2_map_fc_buf(journal, &bh);
+		if (ret) {
+			trace_ext4_journal_fc_commit_cb_stop(sb, 0,
+							     "map_fc_buf");
+			return -ENOMEM;
+		}
+
+		/*
+		 * Release s_fc_lock here since fc_write_inode calls
+		 * ext4_map_blocks which needs i_data_sem.
+		 */
+		ret = ext4_fc_write_inode(journal, bh, inode, tid, subtid,
+					  is_last, NULL);
+		if (ret < 0) {
+			trace_ext4_journal_fc_commit_cb_stop(sb, 0,
+							     "fc_write_inode");
+			return ret;
+		}
+		lock_buffer(bh);
+		clear_buffer_dirty(bh);
+		set_buffer_uptodate(bh);
+		bh->b_end_io = ext4_end_buffer_io_sync;
+		submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
+		spin_lock(&sbi->s_fc_lock);
+
+		num_bufs++;
+	}
+
+	stats->rs_logging = jbd2_time_diff(stats->rs_logging, jiffies);
+	if (num_bufs == 0) {
+		spin_unlock(&sbi->s_fc_lock);
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "no_data");
+		stats->rs_blocks_logged = num_bufs;
+		return 0;
+	}
+
+	/*
+	 * Before returning, check if s_fc_eligible was modified since we
+	 * started.
+	 */
+	if (!sbi->s_fc_eligible) {
+		spin_unlock(&sbi->s_fc_lock);
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "ineligible2");
+		return -ECANCELED;
+	}
+
+	if (unlikely(ext4_forced_shutdown(EXT4_SB(sb)))) {
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "shutdown2");
+		return -EIO;
+	}
+
+	spin_unlock(&sbi->s_fc_lock);
+
+	jbd_debug(3, "%s: Journal blocks ready for fast commit\n", __func__);
+
+	stats->rs_blocks_logged = num_bufs;
+
+	trace_ext4_journal_fc_commit_cb_stop(sb, num_bufs, "success");
+
+	return jbd2_wait_on_fc_bufs(journal, num_bufs);
+}
+
+void ext4_init_fast_commit(struct super_block *sb, journal_t *journal)
+{
+	if (ext4_should_fast_commit(sb)) {
+		journal->j_fc_commit_callback = ext4_journal_fc_commit_cb;
+		journal->j_fc_cleanup_callback = ext4_journal_fc_cleanup_cb;
+	}
+}
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 2cb7e7e1f025..acb9533068c4 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -397,8 +397,14 @@ static inline void ext4_update_inode_fsync_trans(handle_t *handle,
 
 	if (ext4_handle_valid(handle) && !is_handle_aborted(handle)) {
 		ei->i_sync_tid = handle->h_transaction->t_tid;
-		if (datasync)
+		if (ext4_should_fast_commit(inode->i_sb))
+			ei->i_sync_subtid = handle->h_transaction->t_subtid;
+		if (datasync) {
 			ei->i_datasync_tid = handle->h_transaction->t_tid;
+			if (ext4_should_fast_commit(inode->i_sb))
+				ei->i_datasync_subtid =
+						handle->h_transaction->t_subtid;
+		}
 	}
 }
 
@@ -470,6 +476,47 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)
 	return 1;
 }
 
+/* Ext4 fast commit related info */
+
+/* Magic of fast commit header */
+#define EXT4_FC_MAGIC			0xE2540090
+
+#define EXT4_FC_FL_LAST			0x00000001
+
+#define ext4_fc_is_last(__fc_hdr)	(((__fc_hdr)->fc_flags) &	\
+					 EXT4_FC_FL_LAST)
+
+#define ext4_fc_mark_last(__fc_hdr)	(((__fc_hdr)->fc_flags) |=	\
+					 EXT4_FC_FL_LAST)
+
+struct ext4_fc_commit_hdr {
+	/* Fast commit magic, should be EXT4_FC_MAGIC */
+	__le32 fc_magic;
+	/* Sub transaction ID */
+	__le32 fc_subtid;
+	/* Features used by this fast commit block */
+	__u8 fc_features;
+	/* Flags for this block. */
+	__u8 fc_flags;
+	/* Number of TLVs in this fast commmit block */
+	__le16 fc_num_tlvs;
+	/* Inode number */
+	__le32 fc_ino;
+	/* ext4 inode on disk copy */
+	struct ext4_inode inode;
+	/* Csum(hdr+contents) */
+	__le32 fc_csum;
+};
+
+#define EXT4_FC_TAG_EXT		0x1	/* Extent */
+#define EXT4_FC_TAG_DNAME	0x2
+#define EXT4_FC_TAG_PARENT_INO	0x3
+
+struct ext4_fc_tl {
+	__le16 fc_tag;
+	__le16 fc_len;
+};
+
 void ext4_init_inode_fc_info(struct inode *inode);
 extern void ext4_fc_enqueue_inode(handle_t *handle, struct inode *inode);
 extern void ext4_fc_del(struct inode *inode);
@@ -507,4 +554,5 @@ void ext4_fc_update_commit_range(struct inode *inode, ext4_lblk_t start,
 void ext4_fc_mark_new(struct inode *inode);
 bool ext4_is_inode_fc_ineligible(struct inode *inode);
 bool ext4_is_inode_fc_new(struct inode *inode);
+void ext4_init_fast_commit(struct super_block *sb, journal_t *journal);
 #endif	/* _EXT4_JBD2_H */
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index b30f6175eb71..dea4c2632272 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4898,10 +4898,10 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (ret)
 		goto out;
 
-	if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal) {
-		ret = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
-						EXT4_I(inode)->i_sync_tid);
-	}
+	if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal)
+		ret = jbd2_fc_complete_commit(
+		    EXT4_SB(inode->i_sb)->s_journal, EXT4_I(inode)->i_sync_tid,
+		    EXT4_I(inode)->i_sync_subtid);
 out:
 	inode_unlock(inode);
 	trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ea039e3e1a4d..cbfa1ec858a1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5039,20 +5039,25 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 	 */
 	if (journal) {
 		transaction_t *transaction;
-		tid_t tid;
+		tid_t tid, subtid;
 
 		read_lock(&journal->j_state_lock);
 		if (journal->j_running_transaction)
 			transaction = journal->j_running_transaction;
 		else
 			transaction = journal->j_committing_transaction;
-		if (transaction)
+		if (transaction) {
 			tid = transaction->t_tid;
-		else
+			subtid = transaction->t_subtid;
+		} else {
 			tid = journal->j_commit_sequence;
+			subtid = journal->j_fc_sequence;
+		}
 		read_unlock(&journal->j_state_lock);
 		ei->i_sync_tid = tid;
 		ei->i_datasync_tid = tid;
+		ei->i_sync_subtid = subtid;
+		ei->i_datasync_subtid = subtid;
 	}
 
 	if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
@@ -5475,8 +5480,9 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
 		if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync)
 			return 0;
 
-		err = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
-						EXT4_I(inode)->i_sync_tid);
+		err = jbd2_fc_complete_commit(
+		    EXT4_SB(inode->i_sb)->s_journal, EXT4_I(inode)->i_sync_tid,
+		    EXT4_I(inode)->i_sync_subtid);
 	} else {
 		struct ext4_iloc iloc;
 
@@ -5628,6 +5634,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		if (attr->ia_valid & ATTR_GID)
 			inode->i_gid = attr->ia_gid;
 		error = ext4_mark_inode_dirty(handle, inode);
+		ext4_fc_enqueue_inode(handle, inode);
 		ext4_journal_stop(handle);
 	}
 
@@ -5688,6 +5695,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 				inode->i_mtime = current_time(inode);
 				inode->i_ctime = inode->i_mtime;
 			}
+			ext4_fc_enqueue_inode(handle, inode);
 			down_write(&EXT4_I(inode)->i_data_sem);
 			EXT4_I(inode)->i_disksize = attr->ia_size;
 			rc = ext4_mark_inode_dirty(handle, inode);
@@ -5732,6 +5740,8 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 
 	if (!error) {
 		setattr_copy(inode, attr);
+		ext4_fc_enqueue_inode(ext4_journal_current_handle(),
+						   inode);
 		mark_inode_dirty(inode);
 	}
 
@@ -6144,6 +6154,7 @@ void ext4_dirty_inode(struct inode *inode, int flags)
 		goto out;
 
 	ext4_mark_inode_dirty(handle, inode);
+	ext4_fc_enqueue_inode(handle, inode);
 
 	ext4_journal_stop(handle);
 out:
@@ -6229,6 +6240,7 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
 
+	ext4_fc_mark_ineligible(inode);
 	err = ext4_mark_inode_dirty(handle, inode);
 	ext4_handle_sync(handle);
 	ext4_journal_stop(handle);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3e9570ea9748..208c57b5ac80 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1129,6 +1129,16 @@ static void ext4_destroy_inode(struct inode *inode)
 				true);
 		dump_stack();
 	}
+	if (!list_empty(&(EXT4_I(inode)->i_fc_list))) {
+#ifdef EXT4FS_DEBUG
+		if (EXT4_SB(inode->i_sb)->s_fc_eligible) {
+			pr_warn("%s: INODE %ld in FC List with FC allowd",
+				__func__, inode->i_ino);
+			dump_stack();
+		}
+#endif
+		ext4_fc_del(inode);
+	}
 }
 
 static void init_once(void *foo)
@@ -4713,6 +4723,7 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
 	journal->j_commit_interval = sbi->s_commit_interval;
 	journal->j_min_batch_time = sbi->s_min_batch_time;
 	journal->j_max_batch_time = sbi->s_max_batch_time;
+	ext4_init_fast_commit(sb, journal);
 
 	write_lock(&journal->j_state_lock);
 	if (test_opt(sb, BARRIER))
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index d68e9e536814..9c24b1c5239f 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2703,6 +2703,45 @@ TRACE_EVENT(ext4_error,
 		  __entry->function, __entry->line)
 );
 
+TRACE_EVENT(ext4_journal_fc_commit_cb_start,
+	TP_PROTO(struct super_block *sb),
+
+	TP_ARGS(sb),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+	),
+
+	TP_printk("fast_commit started on dev %d,%d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev))
+);
+
+TRACE_EVENT(ext4_journal_fc_commit_cb_stop,
+	    TP_PROTO(struct super_block *sb, int nblks, const char *reason),
+
+	TP_ARGS(sb, nblks, reason),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, nblks)
+		__field(const char *, reason)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+		__entry->nblks = nblks;
+		__entry->reason = reason;
+	),
+
+	TP_printk("fast_commit done on dev %d,%d, nblks %d, reason %s",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->nblks, __entry->reason)
+);
+
 #endif /* _TRACE_EXT4_H */
 
 /* This part must be outside protection */
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 10/13] ext4: fast-commit recovery path changes
  2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
                   ` (8 preceding siblings ...)
  2019-10-01  7:40 ` [PATCH v3 09/13] ext4: fast-commit commit path changes Harshad Shirwadkar
@ 2019-10-01  7:40 ` Harshad Shirwadkar
  2019-10-18  2:07   ` Theodore Y. Ts'o
  2019-10-01  7:41 ` [PATCH v3 11/13] ext4: add support for asynchronous fast commits Harshad Shirwadkar
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 36+ messages in thread
From: Harshad Shirwadkar @ 2019-10-01  7:40 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar

This patch adds core fast-commit recovery path changes. Each fast
commit block stores modified extents and added dentry for a particular
file. Replay code maps blocks in each such extent to the actual file
one-by-one. We also update corresponding file system metadata to account
for newly mapped blocks. Also, for the newly added dentrys we open the
parent inode and add dentry found in fast commit block into the parent
dir. In order to achieve all of these, ext4_inode_csum_set(),
ext4_inode_blocks(), ext4_find_entry(), ext4_add_nondir(),
ext4_reset_inode_seed() which were earlier static are now made visible.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/balloc.c            |   7 +-
 fs/ext4/ext4.h              |  19 ++
 fs/ext4/ext4_jbd2.c         | 369 ++++++++++++++++++++++++++++++++++++
 fs/ext4/extents.c           |  19 +-
 fs/ext4/ialloc.c            |  51 +++--
 fs/ext4/inode.c             |  13 +-
 fs/ext4/ioctl.c             |   6 +-
 fs/ext4/mballoc.c           |  83 ++++++++
 fs/ext4/mballoc.h           |   2 +
 fs/ext4/namei.c             |   4 +-
 include/trace/events/ext4.h |  22 +++
 11 files changed, 560 insertions(+), 35 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 0b202e00d93f..2433f12d2d88 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -360,7 +360,12 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
 				      struct buffer_head *bh)
 {
 	ext4_fsblk_t	blk;
-	struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
+	struct ext4_group_info *grp;
+
+	if (EXT4_SB(sb)->s_fc_replay)
+		return 0;
+
+	grp = ext4_get_group_info(sb, block_group);
 
 	if (buffer_verified(bh))
 		return 0;
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c36ec23046f3..cd5b567d8ca8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1404,6 +1404,13 @@ struct ext4_super_block {
 #define ext4_has_strict_mode(sbi) \
 	(sbi->s_encoding_flags & EXT4_ENC_STRICT_MODE_FL)
 
+struct ext4_fc_replay_state {
+	int fc_replay_error;
+	int fc_replay_expected_off;
+	int fc_replay_expected_tid;
+	int fc_replay_current_subtid;
+};
+
 /*
  * fourth extended-fs super-block data in memory
  */
@@ -1588,6 +1595,7 @@ struct ext4_sb_info {
 					 * Are changes after the last commit
 					 * eligible for fast commit?
 					 */
+	struct ext4_fc_replay_state s_fc_replay_state;
 	spinlock_t s_fc_lock;
 };
 
@@ -2577,6 +2585,10 @@ extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
 extern void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid);
 
 /* inode.c */
+void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
+			 struct ext4_inode_info *ei);
+blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
+			   struct ext4_inode_info *ei);
 int ext4_inode_is_fast_symlink(struct inode *inode);
 struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
 struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
@@ -2660,12 +2672,19 @@ extern int ext4_ind_remove_space(handle_t *handle, struct inode *inode,
 /* ioctl.c */
 extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
 extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
+extern void ext4_reset_inode_seed(struct inode *inode);
 
 /* migrate.c */
 extern int ext4_ext_migrate(struct inode *);
 extern int ext4_ind_migrate(struct inode *inode);
 
 /* namei.c */
+extern struct buffer_head *ext4_find_entry(struct inode *dir,
+					   const struct qstr *d_name,
+					   struct ext4_dir_entry_2 **res_dir,
+				    int *inlined);
+extern int ext4_add_nondir(handle_t *handle,
+		    struct dentry *dentry, struct inode *inode);
 extern int ext4_dirblock_csum_verify(struct inode *inode,
 				     struct buffer_head *bh);
 extern int ext4_orphan_add(handle_t *, struct inode *);
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index fd7740372438..12d6e70bf676 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -5,6 +5,7 @@
 
 #include "ext4_jbd2.h"
 #include "ext4_extents.h"
+#include "mballoc.h"
 
 #include <trace/events/ext4.h>
 
@@ -517,6 +518,16 @@ static inline u8 *fc_add_tag(u8 *dst, u16 tag, u16 len, u8 *val)
 	return dst + sizeof(tl) + len;
 }
 
+static int fc_tag_len(struct ext4_fc_tl *tl)
+{
+	return le16_to_cpu(tl->fc_len);
+}
+
+static u8 *fc_tag_val(struct ext4_fc_tl *tl)
+{
+	return (u8 *)tl + sizeof(*tl);
+}
+
 int ext4_fc_write_inode(journal_t *journal, struct buffer_head *bh,
 			struct inode *inode, tid_t tid, tid_t subtid,
 			int is_last, struct dentry *dentry)
@@ -782,10 +793,368 @@ static int ext4_journal_fc_commit_cb(journal_t *journal, tid_t tid,
 	return jbd2_wait_on_fc_bufs(journal, num_bufs);
 }
 
+int ext4_fc_create_inode(struct super_block *sb, struct ext4_inode *raw_inode,
+			 int ino, unsigned long parent, const char *dname,
+			 int dlen)
+{
+	struct inode *dir = NULL, *inode = NULL;
+	struct dentry *dentry_dir = NULL, *dentry_inode = NULL;
+	struct qstr qstr_dname = QSTR_INIT(dname, dlen);
+	struct ext4_dir_entry_2 *res_dir = NULL;
+	struct buffer_head *dirent_bh;
+	int ret = 0, inlined;
+
+	inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL);
+	if (!IS_ERR(inode)) {
+		jbd_debug(1, "Inode %d already exists.", inode->i_ino);
+		iput(inode);
+		return PTR_ERR(inode);
+	}
+
+	dir = ext4_iget(sb, parent, EXT4_IGET_NORMAL);
+	if (IS_ERR(dir)) {
+		jbd_debug(1, "Dir with inode %d not found.", parent);
+		ret = PTR_ERR(inode);
+		goto out;
+	}
+
+	dentry_dir = d_obtain_alias(dir);
+	if (IS_ERR(dentry_dir)) {
+		jbd_debug(1, "Failed to obtain dentry");
+		ret = PTR_ERR(dentry_dir);
+		goto out;
+	}
+
+	dentry_inode = d_alloc(dentry_dir, &qstr_dname);
+	if (!dentry_inode) {
+		jbd_debug(1, "Inode dentry not created.");
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	inode = ext4_new_inode(NULL, dir, le16_to_cpu(raw_inode->i_mode), NULL,
+			       ino, NULL, le32_to_cpu(raw_inode->i_flags));
+	if (IS_ERR(inode)) {
+		jbd_debug(1, "Failed to create a new inode.");
+		ret = PTR_ERR(inode);
+		goto out;
+	}
+
+	dirent_bh = ext4_find_entry(dir, &qstr_dname, &res_dir, &inlined);
+	if (!dirent_bh || IS_ERR(dirent_bh)) {
+		ret = ext4_add_nondir(NULL, dentry_inode, inode);
+		if (ret != 0) {
+			jbd_debug(1, "Failed to add dentry\n");
+			goto out;
+		}
+	} else {
+		if (le32_to_cpu(res_dir->inode) != inode->i_ino) {
+			jbd_debug(1, "Entry exists and mismatched inode nos.");
+			brelse(dirent_bh);
+			ret = -EEXIST;
+			goto out;
+		}
+		brelse(dirent_bh);
+	}
+
+	ext4_mark_inode_dirty(NULL, dir);
+
+out:
+	if (dentry_dir) {
+		d_drop(dentry_dir);
+		dput(dentry_dir);
+	} else if (dir) {
+		iput(dir);
+	}
+	if (dentry_inode) {
+		d_drop(dentry_inode);
+		dput(dentry_inode);
+	}
+
+	return 0;
+}
+
+static int ext4_journal_fc_replay_scan(struct super_block *sb,
+				       struct buffer_head *bh, int off)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_replay_state *state;
+	struct ext4_fc_commit_hdr *fc_hdr;
+	struct ext4_fc_tl *tl;
+	__u32 csum, dummy_csum = 0;
+	__u8 *start;
+	tid_t fc_subtid;
+	int i;
+
+	state = &sbi->s_fc_replay_state;
+	fc_hdr = (struct ext4_fc_commit_hdr *)
+		  ((__u8 *)bh->b_data + sizeof(journal_header_t));
+
+	fc_subtid = le32_to_cpu(fc_hdr->fc_subtid);
+
+	if (le32_to_cpu(fc_hdr->fc_magic) != EXT4_FC_MAGIC) {
+		state->fc_replay_error = -ENOENT;
+		goto out_err;
+	}
+
+	if (off != state->fc_replay_expected_off) {
+		state->fc_replay_error = -EFSCORRUPTED;
+		goto out_err;
+	}
+
+	if (le16_to_cpu(fc_hdr->fc_features)) {
+		state->fc_replay_error = -EOPNOTSUPP;
+		goto out_err;
+	}
+
+	/* Check if we already concluded that this fast commit is not useful */
+	if (state->fc_replay_error && state->fc_replay_error != -EPROTO)
+		goto out_err;
+
+	if (state->fc_replay_expected_off == 0) {
+		/* This is a first block */
+		state->fc_replay_current_subtid = fc_subtid;
+		/*
+		 * We set replay error by default until we find an end
+		 * block for a particular subtid
+		 */
+		state->fc_replay_error = -EPROTO;
+	}
+
+	if (state->fc_replay_error == 0) {
+		/*
+		 * We have already encountered _last_ block for previous
+		 * subtid. So we should only find a bigger subtid here.
+		 */
+		if (fc_subtid <= state->fc_replay_current_subtid) {
+			state->fc_replay_error = -EFSCORRUPTED;
+			goto out_err;
+		}
+		state->fc_replay_current_subtid = fc_subtid;
+		state->fc_replay_error = -EPROTO;
+	} else if (state->fc_replay_current_subtid != fc_subtid) {
+		/*
+		 * Different subtid found before we found the end of this
+		 * subtid.
+		 */
+		state->fc_replay_error = -EFSCORRUPTED;
+		goto out_err;
+	}
+
+	/*
+	 * We can replay fast commit blocks only if we find a _last_ block for
+	 * all subtids.
+	 */
+	if (ext4_fc_is_last(fc_hdr))
+		state->fc_replay_error = 0;
+
+	csum = ext4_chksum(sbi, 0, fc_hdr,
+			   offsetof(struct ext4_fc_commit_hdr, fc_csum));
+	csum = ext4_chksum(sbi, csum, &dummy_csum, sizeof(dummy_csum));
+
+	tl = (struct ext4_fc_tl *)(fc_hdr + 1);
+	start = (__u8 *)tl;
+	for (i = 0; i < le16_to_cpu(fc_hdr->fc_num_tlvs); i++) {
+		switch (le16_to_cpu(tl->fc_tag)) {
+		case EXT4_FC_TAG_PARENT_INO:
+		case EXT4_FC_TAG_DNAME:
+		case EXT4_FC_TAG_EXT:
+			break;
+		default:
+			goto out_err;
+		}
+		tl = (struct ext4_fc_tl *)((__u8 *)tl +
+					   le16_to_cpu(tl->fc_len) +
+					   sizeof(*tl));
+	}
+	csum = ext4_chksum(sbi, csum, start, (__u8 *)tl - start);
+	if (csum != le32_to_cpu(fc_hdr->fc_csum)) {
+		state->fc_replay_error = -EFSBADCRC;
+		goto out_err;
+	}
+
+	state->fc_replay_expected_off++;
+	return 0;
+
+out_err:
+	trace_ext4_journal_fc_replay_scan(sb, off, state->fc_replay_error);
+	return state->fc_replay_error;
+}
+
+static void ext4_fc_add_block(struct inode *inode, ext4_lblk_t lblk,
+			      ext4_fsblk_t pblk, int unwritten)
+{
+	struct ext4_extent ex;
+	struct ext4_ext_path *path = NULL;
+	struct ext4_map_blocks map;
+	int ret;
+
+	map.m_lblk = lblk;
+	map.m_len = 0x1;
+	ret = ext4_map_blocks(NULL, inode, &map, 0);
+	if (ret > 0) {
+		if (pblk != map.m_pblk)
+			jbd_debug(1, "Bad mapping found while replaying fc\n");
+		return;
+	}
+
+	ex.ee_block = cpu_to_le32(lblk);
+	ext4_ext_store_pblock(&ex, pblk);
+	ex.ee_len = cpu_to_le16(0x1);
+	if (unwritten)
+		ext4_ext_mark_unwritten(&ex);
+
+	path = ext4_find_extent(inode, lblk, NULL, 0);
+	if (path) {
+		down_write(&EXT4_I(inode)->i_data_sem);
+		ret = ext4_ext_insert_extent(NULL, inode, &path, &ex, 0);
+		ext4_mb_mark_used(inode->i_sb, ext4_ext_pblock(&ex), 0x1);
+		up_write((&EXT4_I(inode)->i_data_sem));
+		kfree(path);
+	}
+}
+
+static int ext4_journal_fc_replay_cb(journal_t *journal, struct buffer_head *bh,
+				     enum passtype pass, int off)
+{
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_commit_hdr *fc_hdr;
+	struct ext4_fc_tl *tl;
+	struct ext4_iloc iloc;
+	struct ext4_extent *ex;
+	struct inode *inode;
+	char *dname = NULL;
+	int dname_len = 0;
+	int parent_ino = -1;
+	int i, j, ret;
+
+	if (pass == PASS_SCAN)
+		return ext4_journal_fc_replay_scan(sb, bh, off);
+
+	if (sbi->s_fc_replay_state.fc_replay_error) {
+		jbd_debug(1, "FC replay error set = %d\n",
+			  sbi->s_fc_replay_state.fc_replay_error);
+		return sbi->s_fc_replay_state.fc_replay_error;
+	}
+
+	sbi->s_fc_replay = true;
+	fc_hdr = (struct ext4_fc_commit_hdr *)
+		  ((__u8 *)bh->b_data + sizeof(journal_header_t));
+
+	jbd_debug(3, "%s: Got FC block for inode %d at [%d,%d]", __func__,
+		  le32_to_cpu(fc_hdr->fc_ino),
+		  be32_to_cpu(((journal_header_t *)bh->b_data)->h_sequence),
+		  le32_to_cpu(fc_hdr->fc_subtid));
+
+	tl = (struct ext4_fc_tl *)(fc_hdr + 1);
+	if (le16_to_cpu(fc_hdr->fc_num_tlvs) >= 2) {
+		for (i = 0; i < 2; i++) {
+			switch (le16_to_cpu(tl->fc_tag)) {
+			case EXT4_FC_TAG_DNAME:
+				dname = fc_tag_val(tl);
+				dname_len = fc_tag_len(tl);
+				break;
+			case EXT4_FC_TAG_PARENT_INO:
+				parent_ino = le32_to_cpu(
+				    *(__le32 *)fc_tag_val(tl));
+				break;
+			}
+			tl = (struct ext4_fc_tl *)(fc_tag_val(tl) +
+						   fc_tag_len(tl));
+		}
+	}
+
+	if (parent_ino && dname) {
+		ret = ext4_fc_create_inode(sb, &fc_hdr->inode,
+				     le32_to_cpu(fc_hdr->fc_ino), parent_ino,
+				     dname, dname_len);
+		if (ret) {
+			jbd_debug(1, "Failed to create ext4 inode.");
+			return ret;
+		}
+	}
+
+	inode = ext4_iget(sb, le32_to_cpu(fc_hdr->fc_ino), EXT4_IGET_NORMAL);
+	if (IS_ERR(inode))
+		return 0;
+
+	ret = ext4_get_inode_loc(inode, &iloc);
+	if (ret)
+		return ret;
+
+	inode_lock(inode);
+	tl = (struct ext4_fc_tl *)(fc_hdr + 1);
+	for (i = 0; i < le16_to_cpu(fc_hdr->fc_num_tlvs); i++) {
+		switch (le16_to_cpu(tl->fc_tag)) {
+		case EXT4_FC_TAG_EXT:
+			ex = (struct ext4_extent *)(tl + 1);
+			/*
+			 * We add block by block because part of extent may
+			 * already have been added by a previous fast commit
+			 * replay.
+			 */
+			for (j = 0; j < ext4_ext_get_actual_len(ex); j++)
+				ext4_fc_add_block(inode,
+						  le32_to_cpu(ex->ee_block) + j,
+						  ext4_ext_pblock(ex) + j,
+						  ext4_ext_is_unwritten(ex));
+			break;
+		case EXT4_FC_TAG_PARENT_INO:
+		case EXT4_FC_TAG_DNAME:
+			break;
+		default:
+			jbd_debug(1, "Unknown tag found.\n");
+		}
+		tl = (struct ext4_fc_tl *)((__u8 *)tl +
+					   le16_to_cpu(tl->fc_len) +
+					   sizeof(*tl));
+	}
+	ext4_reserve_inode_write(NULL, inode, &iloc);
+	inode_unlock(inode);
+
+	/*
+	 * Unless inode contains inline data, copy everything except
+	 * i_blocks. i_blocks would have been set alright by ext4_fc_add_block
+	 * call above.
+	 */
+	if (ext4_has_inline_data(inode)) {
+		memcpy(ext4_raw_inode(&iloc), &fc_hdr->inode,
+		       sizeof(struct ext4_inode));
+	} else {
+		memcpy(ext4_raw_inode(&iloc), &fc_hdr->inode,
+		       offsetof(struct ext4_inode, i_block));
+		memcpy(&ext4_raw_inode(&iloc)->i_generation,
+		       &fc_hdr->inode.i_generation,
+		       sizeof(struct ext4_inode) -
+		       offsetof(struct ext4_inode, i_generation));
+	}
+	inode->i_generation = le32_to_cpu(ext4_raw_inode(&iloc)->i_generation);
+	ext4_reset_inode_seed(inode);
+
+	ext4_inode_csum_set(inode, ext4_raw_inode(&iloc), EXT4_I(inode));
+	ret = ext4_handle_dirty_metadata(NULL, inode, iloc.bh);
+	brelse(iloc.bh);
+	iput(inode);
+	if (!ret)
+		ret = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
+
+	sbi->s_fc_replay = false;
+
+	return ret;
+}
+
 void ext4_init_fast_commit(struct super_block *sb, journal_t *journal)
 {
 	if (ext4_should_fast_commit(sb)) {
 		journal->j_fc_commit_callback = ext4_journal_fc_commit_cb;
 		journal->j_fc_cleanup_callback = ext4_journal_fc_cleanup_cb;
 	}
+
+	/*
+	 * We set replay callback even if fast commit disabled because we may
+	 * could still have fast commit blocks that need to be replayed even if
+	 * fast commit has now been turned off.
+	 */
+	journal->j_fc_replay_callback = ext4_journal_fc_replay_cb;
 }
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index dea4c2632272..d70c09cbbc3f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2893,7 +2893,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 	int depth = ext_depth(inode);
 	struct ext4_ext_path *path = NULL;
 	struct partial_cluster partial;
-	handle_t *handle;
+	handle_t *handle = NULL;
 	int i = 0, err = 0;
 
 	partial.pclu = 0;
@@ -2903,9 +2903,11 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 	ext_debug("truncate since %u to %u\n", start, end);
 
 	/* probably first extent we're gonna free will be last in block */
-	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, depth + 1);
-	if (IS_ERR(handle))
-		return PTR_ERR(handle);
+	if (!sbi->s_fc_replay) {
+		handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, depth + 1);
+		if (IS_ERR(handle))
+			return PTR_ERR(handle);
+	}
 
 again:
 	trace_ext4_ext_remove_space(inode, start, end, depth);
@@ -2925,7 +2927,8 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 		/* find extent for or closest extent to this block */
 		path = ext4_find_extent(inode, end, NULL, EXT4_EX_NOCACHE);
 		if (IS_ERR(path)) {
-			ext4_journal_stop(handle);
+			if (!sbi->s_fc_replay)
+				ext4_journal_stop(handle);
 			return PTR_ERR(path);
 		}
 		depth = ext_depth(inode);
@@ -3011,7 +3014,8 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 		path = kcalloc(depth + 1, sizeof(struct ext4_ext_path),
 			       GFP_NOFS);
 		if (path == NULL) {
-			ext4_journal_stop(handle);
+			if (!sbi->s_fc_replay)
+				ext4_journal_stop(handle);
 			return -ENOMEM;
 		}
 		path[0].p_maxdepth = path[0].p_depth = depth;
@@ -3141,7 +3145,8 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 	path = NULL;
 	if (err == -EAGAIN)
 		goto again;
-	ext4_journal_stop(handle);
+	if (!sbi->s_fc_replay)
+		ext4_journal_stop(handle);
 
 	return err;
 }
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 47d04a33a3ca..d32dea0757fe 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -82,7 +82,12 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
 				      struct buffer_head *bh)
 {
 	ext4_fsblk_t	blk;
-	struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
+	struct ext4_group_info *grp;
+
+	if (EXT4_SB(sb)->s_fc_replay)
+		return 0;
+
+	grp = ext4_get_group_info(sb, block_group);
 
 	if (buffer_verified(bh))
 		return 0;
@@ -287,15 +292,17 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
 	bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb);
 	bitmap_bh = ext4_read_inode_bitmap(sb, block_group);
 	/* Don't bother if the inode bitmap is corrupt. */
-	grp = ext4_get_group_info(sb, block_group);
 	if (IS_ERR(bitmap_bh)) {
 		fatal = PTR_ERR(bitmap_bh);
 		bitmap_bh = NULL;
 		goto error_return;
 	}
-	if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp))) {
-		fatal = -EFSCORRUPTED;
-		goto error_return;
+	if (!sbi->s_fc_replay) {
+		grp = ext4_get_group_info(sb, block_group);
+		if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp))) {
+			fatal = -EFSCORRUPTED;
+			goto error_return;
+		}
 	}
 
 	BUFFER_TRACE(bitmap_bh, "get_write_access");
@@ -758,7 +765,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 	struct inode *ret;
 	ext4_group_t i;
 	ext4_group_t flex_group;
-	struct ext4_group_info *grp;
+	struct ext4_group_info *grp = NULL;
 	int encrypt = 0;
 
 	/* Cannot create files in a deleted directory */
@@ -896,15 +903,20 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 		if (ext4_free_inodes_count(sb, gdp) == 0)
 			goto next_group;
 
-		grp = ext4_get_group_info(sb, group);
-		/* Skip groups with already-known suspicious inode tables */
-		if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp))
-			goto next_group;
+		if (!sbi->s_fc_replay) {
+			grp = ext4_get_group_info(sb, group);
+			/*
+			 * Skip groups with already-known suspicious inode
+			 * tables
+			 */
+			if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp))
+				goto next_group;
+		}
 
 		brelse(inode_bitmap_bh);
 		inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
 		/* Skip groups with suspicious inode tables */
-		if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp) ||
+		if ((!sbi->s_fc_replay && EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) ||
 		    IS_ERR(inode_bitmap_bh)) {
 			inode_bitmap_bh = NULL;
 			goto next_group;
@@ -923,7 +935,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 			goto next_group;
 		}
 
-		if (!handle) {
+		if (!sbi->s_fc_replay && !handle) {
 			BUG_ON(nblocks <= 0);
 			handle = __ext4_journal_start_sb(dir->i_sb, line_no,
 							 handle_type, nblocks,
@@ -1027,9 +1039,15 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 	/* Update the relevant bg descriptor fields */
 	if (ext4_has_group_desc_csum(sb)) {
 		int free;
-		struct ext4_group_info *grp = ext4_get_group_info(sb, group);
-
-		down_read(&grp->alloc_sem); /* protect vs itable lazyinit */
+		struct ext4_group_info *grp = NULL;
+
+		if (!sbi->s_fc_replay) {
+			grp = ext4_get_group_info(sb, group);
+			down_read(&grp->alloc_sem); /*
+						     * protect vs itable
+						     * lazyinit
+						     */
+		}
 		ext4_lock_group(sb, group); /* while we modify the bg desc */
 		free = EXT4_INODES_PER_GROUP(sb) -
 			ext4_itable_unused_count(sb, gdp);
@@ -1045,7 +1063,8 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 		if (ino > free)
 			ext4_itable_unused_set(sb, gdp,
 					(EXT4_INODES_PER_GROUP(sb) - ino));
-		up_read(&grp->alloc_sem);
+		if (!sbi->s_fc_replay)
+			up_read(&grp->alloc_sem);
 	} else {
 		ext4_lock_group(sb, group);
 	}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cbfa1ec858a1..9e5d8a82556f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -103,8 +103,8 @@ static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw,
 	return provided == calculated;
 }
 
-static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
-				struct ext4_inode_info *ei)
+void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
+			 struct ext4_inode_info *ei)
 {
 	__u32 csum;
 
@@ -4800,8 +4800,8 @@ void ext4_set_inode_flags(struct inode *inode)
 			S_ENCRYPTED|S_CASEFOLD);
 }
 
-static blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
-				  struct ext4_inode_info *ei)
+blkcnt_t ext4_inode_blocks(struct ext4_inode *raw_inode,
+			   struct ext4_inode_info *ei)
 {
 	blkcnt_t i_blocks ;
 	struct inode *inode = &(ei->vfs_inode);
@@ -4951,8 +4951,9 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 	}
 
 	if (!ext4_inode_csum_verify(inode, raw_inode, ei)) {
-		ext4_error_inode(inode, function, line, 0,
-				 "iget: checksum invalid");
+		if (!EXT4_SB(sb)->s_fc_replay)
+			ext4_error_inode(inode, function, line, 0,
+					 "iget: checksum invalid");
 		ret = -EFSBADCRC;
 		goto bad_inode;
 	}
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index a8e23acb5c03..35019e9d2803 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -86,7 +86,7 @@ static void swap_inode_data(struct inode *inode1, struct inode *inode2)
 	i_size_write(inode2, isize);
 }
 
-static void reset_inode_seed(struct inode *inode)
+void ext4_reset_inode_seed(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -199,8 +199,8 @@ static long swap_inode_boot_loader(struct super_block *sb,
 
 	inode->i_generation = prandom_u32();
 	inode_bl->i_generation = prandom_u32();
-	reset_inode_seed(inode);
-	reset_inode_seed(inode_bl);
+	ext4_reset_inode_seed(inode);
+	ext4_reset_inode_seed(inode_bl);
 
 	ext4_discard_preallocations(inode);
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a3e2767bdf2f..70551fa91237 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2915,6 +2915,89 @@ void ext4_exit_mballoc(void)
 }
 
 
+void ext4_mb_mark_used(struct super_block *sb, ext4_fsblk_t block,
+		       int len)
+{
+	struct buffer_head *bitmap_bh = NULL;
+	struct ext4_group_desc *gdp;
+	struct buffer_head *gdp_bh;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	ext4_group_t group;
+	ext4_fsblk_t cluster;
+	ext4_grpblk_t blkoff;
+	int i, clen, err;
+	int already_allocated_count;
+
+	cluster = EXT4_B2C(sbi, block);
+	clen = EXT4_B2C(sbi, len);
+
+	ext4_get_group_no_and_offset(sb, block, &group, &blkoff);
+	bitmap_bh = ext4_read_block_bitmap(sb, group);
+	if (IS_ERR(bitmap_bh)) {
+		err = PTR_ERR(bitmap_bh);
+		bitmap_bh = NULL;
+		goto out_err;
+	}
+
+	err = -EIO;
+	gdp = ext4_get_group_desc(sb, group, &gdp_bh);
+	if (!gdp)
+		goto out_err;
+
+	if (!ext4_data_block_valid(sbi, block, len)) {
+		ext4_error(sb, "Allocating blks %llu-%llu which overlap mdata",
+			   cluster, cluster+clen);
+		/* File system mounted not to panic on error
+		 * Fix the bitmap and return EFSCORRUPTED
+		 * We leak some of the blocks here.
+		 */
+		ext4_lock_group(sb, group);
+		ext4_set_bits(bitmap_bh->b_data, blkoff, clen);
+		ext4_unlock_group(sb, group);
+		err = ext4_handle_dirty_metadata(NULL, NULL, bitmap_bh);
+		if (!err)
+			err = -EFSCORRUPTED;
+		goto out_err;
+	}
+
+	ext4_lock_group(sb, group);
+	already_allocated_count = 0;
+	for (i = 0; i < clen; i++)
+		if (mb_test_bit(blkoff + i, bitmap_bh->b_data))
+			already_allocated_count++;
+
+	ext4_set_bits(bitmap_bh->b_data, blkoff, clen);
+	if (ext4_has_group_desc_csum(sb) &&
+	    (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
+		gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
+		ext4_free_group_clusters_set(sb, gdp,
+					     ext4_free_clusters_after_init(sb,
+						group, gdp));
+	}
+	clen = ext4_free_group_clusters(sb, gdp) - clen +
+	       already_allocated_count;
+	ext4_free_group_clusters_set(sb, gdp, clen);
+	ext4_block_bitmap_csum_set(sb, group, gdp, bitmap_bh);
+	ext4_group_desc_csum_set(sb, group, gdp);
+
+	ext4_unlock_group(sb, group);
+
+	if (sbi->s_log_groups_per_flex) {
+		ext4_group_t flex_group = ext4_flex_group(sbi, group);
+
+		atomic64_sub(len,
+			     &sbi->s_flex_groups[flex_group].free_clusters);
+	}
+
+	err = ext4_handle_dirty_metadata(NULL, NULL, bitmap_bh);
+	if (err)
+		goto out_err;
+	err = ext4_handle_dirty_metadata(NULL, NULL, gdp_bh);
+
+out_err:
+	brelse(bitmap_bh);
+}
+
 /*
  * Check quota and mark chosen space (ac->ac_b_ex) non-free in bitmaps
  * Returns 0 if success or error code
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 88c98f17e3d9..1881710041b6 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -215,4 +215,6 @@ ext4_mballoc_query_range(
 	ext4_mballoc_query_range_fn	formatter,
 	void				*priv);
 
+void ext4_mb_mark_used(struct super_block *sb, ext4_fsblk_t block,
+		       int len);
 #endif
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 8b73c5a38d49..0f0b6a64b3b1 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1578,7 +1578,7 @@ static struct buffer_head *__ext4_find_entry(struct inode *dir,
 	return ret;
 }
 
-static struct buffer_head *ext4_find_entry(struct inode *dir,
+struct buffer_head *ext4_find_entry(struct inode *dir,
 					   const struct qstr *d_name,
 					   struct ext4_dir_entry_2 **res_dir,
 					   int *inlined)
@@ -2549,7 +2549,7 @@ static void ext4_dec_count(handle_t *handle, struct inode *inode)
 }
 
 
-static int ext4_add_nondir(handle_t *handle,
+int ext4_add_nondir(handle_t *handle,
 		struct dentry *dentry, struct inode *inode)
 {
 	int err = ext4_add_entry(handle, dentry, inode);
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 9c24b1c5239f..59329d69d0fc 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2703,6 +2703,28 @@ TRACE_EVENT(ext4_error,
 		  __entry->function, __entry->line)
 );
 
+TRACE_EVENT(ext4_journal_fc_replay_scan,
+	TP_PROTO(struct super_block *sb, int error, int off),
+
+	TP_ARGS(sb, error, off),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, error)
+		__field(int, off)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+		__entry->error = error;
+		__entry->off = off;
+	),
+
+	TP_printk("FC scan pass on dev %d,%d: error %d, off %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->error, __entry->off)
+);
+
 TRACE_EVENT(ext4_journal_fc_commit_cb_start,
 	TP_PROTO(struct super_block *sb),
 
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 11/13] ext4: add support for asynchronous fast commits
  2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
                   ` (9 preceding siblings ...)
  2019-10-01  7:40 ` [PATCH v3 10/13] ext4: fast-commit recovery " Harshad Shirwadkar
@ 2019-10-01  7:41 ` Harshad Shirwadkar
  2019-10-25  6:28   ` Xiaoguang Wang
  2019-10-01  7:41 ` [PATCH v3 12/13] docs: Add fast commit documentation Harshad Shirwadkar
  2019-10-04 19:12 ` [PATCH v3 00/13] ext4: add fast commit support Theodore Y. Ts'o
  12 siblings, 1 reply; 36+ messages in thread
From: Harshad Shirwadkar @ 2019-10-01  7:41 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar

Until this patch, fast commits could only be invoked by jbd2 thread.
This patch allows file system to perform fast commit in an async manner
without involving jbd2 thread. This makes fast commits even faster as
it gets rid of the time spent in context switching to jbd2 thread. In
order to avoid race between jbd2 thread and async fast commits, we add
new jbd2 APIs that allow file systems to indicate their intent of
performing an async fast commit.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h        |  3 ++
 fs/ext4/ext4_jbd2.c   | 74 +++++++++++++++++++++++++++++++++++++++++++
 fs/ext4/fsync.c       |  7 ++--
 fs/jbd2/commit.c      | 11 +++++++
 fs/jbd2/journal.c     | 59 ++++++++++++++++++++++++++++++++++
 fs/jbd2/transaction.c |  2 ++
 include/linux/jbd2.h  | 10 ++++++
 7 files changed, 164 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index cd5b567d8ca8..a8a481c5ffa4 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2716,6 +2716,9 @@ extern int ext4_group_extend(struct super_block *sb,
 extern int ext4_resize_fs(struct super_block *sb, ext4_fsblk_t n_blocks_count);
 
 /* super.c */
+int ext4_fc_async_commit(journal_t *journal, tid_t commit_tid,
+			 tid_t commit_subtid, struct inode *inode,
+			 struct dentry *dentry);
 extern struct buffer_head *ext4_sb_bread(struct super_block *sb,
 					 sector_t block, int op_flags);
 extern int ext4_seq_options_show(struct seq_file *seq, void *offset);
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 12d6e70bf676..cf796268322b 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -1144,6 +1144,80 @@ static int ext4_journal_fc_replay_cb(journal_t *journal, struct buffer_head *bh,
 	return ret;
 }
 
+int ext4_fc_async_commit(journal_t *journal, tid_t commit_tid,
+			 tid_t commit_subtid, struct inode *inode,
+			 struct dentry *dentry)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct super_block *sb = inode->i_sb;
+	struct buffer_head *bh;
+	int ret;
+
+	if (!ext4_should_fast_commit(sb))
+		return jbd2_complete_transaction(journal, commit_tid);
+
+	read_lock(&ei->i_fc.fc_lock);
+	if (ei->i_fc.fc_tid != commit_tid) {
+		read_unlock(&ei->i_fc.fc_lock);
+		return 0;
+	}
+	read_unlock(&ei->i_fc.fc_lock);
+
+	if (ext4_is_inode_fc_ineligible(inode))
+		return jbd2_complete_transaction(journal, commit_tid);
+
+	if (jbd2_commit_check(journal, commit_tid, commit_subtid))
+		return 0;
+
+	ret = jbd2_start_async_fc(journal, commit_tid);
+	if (ret)
+		return jbd2_fc_complete_commit(journal, commit_tid,
+					       commit_subtid);
+
+	trace_ext4_journal_fc_commit_cb_start(sb);
+
+	ret = jbd2_submit_inode_data(journal, ei->jinode);
+	if (ret)
+		goto out;
+
+	ret = jbd2_map_fc_buf(journal, &bh);
+	if (ret) {
+		jbd2_stop_async_fc(journal, commit_tid);
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "map_fc_buf");
+		return jbd2_complete_transaction(journal, commit_tid);
+
+	}
+
+	ret = ext4_fc_write_inode(journal, bh, inode, commit_tid,
+				  commit_subtid, 1, dentry);
+
+	if (ret < 0) {
+		brelse(bh);
+		jbd2_stop_async_fc(journal, commit_tid);
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "fc_write_inode");
+		return jbd2_complete_transaction(journal, commit_tid);
+	}
+	lock_buffer(bh);
+	clear_buffer_dirty(bh);
+	set_buffer_uptodate(bh);
+	bh->b_end_io = ext4_end_buffer_io_sync;
+	submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
+
+	jbd2_stop_async_fc(journal, commit_tid);
+	wait_on_buffer(bh);
+	if (unlikely(!buffer_uptodate(bh))) {
+		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "IO");
+		return -EIO;
+	}
+
+out:
+	trace_ext4_journal_fc_commit_cb_stop(sb,
+					     ret < 0 ? 0 : ret,
+					     ret >= 0 ? "success" : "fail");
+	wake_up(&journal->j_wait_async_fc);
+	return ret;
+}
+
 void ext4_init_fast_commit(struct super_block *sb, journal_t *journal)
 {
 	if (ext4_should_fast_commit(sb)) {
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 5508baa11bb6..5bbfc55e1756 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -98,7 +98,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
 	int ret = 0, err;
-	tid_t commit_tid;
+	tid_t commit_tid, commit_subtid;
 	bool needs_barrier = false;
 
 	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
@@ -148,10 +148,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	}
 
 	commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
+	commit_subtid = datasync ? ei->i_datasync_subtid : ei->i_sync_subtid;
+
 	if (journal->j_flags & JBD2_BARRIER &&
 	    !jbd2_trans_will_send_data_barrier(journal, commit_tid))
 		needs_barrier = true;
-	ret = jbd2_complete_transaction(journal, commit_tid);
+	ret = ext4_fc_async_commit(journal, commit_tid, commit_subtid,
+				   inode, file->f_path.dentry);
 	if (needs_barrier) {
 	issue_flush:
 		err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index e85f51e1cc70..18cb70fa2421 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -452,6 +452,17 @@ void jbd2_journal_commit_transaction(journal_t *journal, bool *fc)
 
 	write_lock(&journal->j_state_lock);
 	full_commit = journal->j_do_full_commit;
+	journal->j_running_transaction->t_async_fc_allowed = false;
+	while (journal->j_running_transaction->t_async_fc_ongoing) {
+		DEFINE_WAIT(wait);
+
+		prepare_to_wait(&journal->j_wait_async_fc, &wait,
+				TASK_UNINTERRUPTIBLE);
+		write_unlock(&journal->j_state_lock);
+		schedule();
+		write_lock(&journal->j_state_lock);
+		finish_wait(&journal->j_wait_async_fc, &wait);
+	}
 	write_unlock(&journal->j_state_lock);
 
 	/* Let file-system try its own fast commit */
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e0684212384d..81daa2cff67f 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -794,6 +794,64 @@ int jbd2_commit_check(journal_t *journal, tid_t tid, tid_t subtid)
 	return 0;
 }
 
+int jbd2_start_async_fc(journal_t *journal, tid_t tid)
+{
+	transaction_t *txn;
+	int ret = -EINVAL;
+
+	if (!journal->j_running_transaction)
+		return ret;
+
+	if (journal->j_running_transaction->t_tid != tid)
+		return ret;
+
+	txn = journal->j_running_transaction;
+	write_lock(&journal->j_state_lock);
+	while (txn->t_state == T_RUNNING) {
+		DEFINE_WAIT(wait);
+
+		if (txn->t_async_fc_allowed) {
+			if (!txn->t_async_fc_ongoing) {
+				txn->t_async_fc_ongoing = true;
+				ret = 0;
+				break;
+			}
+			prepare_to_wait(&journal->j_wait_async_fc,
+					&wait, TASK_UNINTERRUPTIBLE);
+			write_unlock(&journal->j_state_lock);
+			schedule();
+			write_lock(&journal->j_state_lock);
+			finish_wait(&journal->j_wait_async_fc, &wait);
+		} else {
+			ret = -ECANCELED;
+			break;
+		}
+	}
+	write_unlock(&journal->j_state_lock);
+
+	return ret;
+}
+
+int jbd2_stop_async_fc(journal_t *journal, tid_t tid)
+{
+	transaction_t *txn;
+
+	if (!journal->j_running_transaction)
+		return -EINVAL;
+
+	if (journal->j_running_transaction->t_tid != tid)
+		return -EINVAL;
+
+	txn = journal->j_running_transaction;
+	write_lock(&journal->j_state_lock);
+	J_ASSERT(txn->t_state == T_RUNNING);
+	txn->t_async_fc_ongoing = false;
+	txn->t_subtid++;
+	write_unlock(&journal->j_state_lock);
+	return 0;
+
+}
+
 /* Return 1 when transaction with given tid has already committed. */
 int jbd2_transaction_committed(journal_t *journal, tid_t tid)
 {
@@ -1308,6 +1366,7 @@ static journal_t *journal_init_common(struct block_device *bdev,
 	init_waitqueue_head(&journal->j_wait_commit);
 	init_waitqueue_head(&journal->j_wait_updates);
 	init_waitqueue_head(&journal->j_wait_reserved);
+	init_waitqueue_head(&journal->j_wait_async_fc);
 	mutex_init(&journal->j_barrier);
 	mutex_init(&journal->j_checkpoint_mutex);
 	spin_lock_init(&journal->j_revoke_lock);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index ce7f03cfd90b..f17f813b5610 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -103,6 +103,8 @@ static void jbd2_get_transaction(journal_t *journal,
 	transaction->t_max_wait = 0;
 	transaction->t_start = jiffies;
 	transaction->t_requested = 0;
+	transaction->t_async_fc_allowed = true;
+	transaction->t_async_fc_ongoing = false;
 }
 
 /*
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 312103fc9581..5610f16de919 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -604,6 +604,7 @@ struct transaction_s
 		T_FINISHED
 	}			t_state;
 
+	bool t_async_fc_allowed, t_async_fc_ongoing;
 	/*
 	 * Where in the log does this transaction's commit start? [no locking]
 	 */
@@ -869,6 +870,13 @@ struct journal_s
 	 */
 	wait_queue_head_t	j_wait_reserved;
 
+	/**
+	 * @j_wait_async_fc:
+	 *
+	 * Wait queue to wait for completion of async fast commits.
+	 */
+	wait_queue_head_t	j_wait_async_fc;
+
 	/**
 	 * @j_checkpoint_mutex:
 	 *
@@ -1594,6 +1602,8 @@ int jbd2_complete_transaction(journal_t *journal, tid_t tid);
 int jbd2_log_do_checkpoint(journal_t *journal);
 int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
 int jbd2_fc_complete_commit(journal_t *journal, tid_t tid, tid_t subtid);
+int jbd2_start_async_fc(journal_t *journal, tid_t tid);
+int jbd2_stop_async_fc(journal_t *journal, tid_t tid);
 
 void __jbd2_log_wait_for_space(journal_t *journal);
 extern void __jbd2_journal_drop_transaction(journal_t *, transaction_t *);
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 12/13] docs: Add fast commit documentation
  2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
                   ` (10 preceding siblings ...)
  2019-10-01  7:41 ` [PATCH v3 11/13] ext4: add support for asynchronous fast commits Harshad Shirwadkar
@ 2019-10-01  7:41 ` Harshad Shirwadkar
  2019-10-18  1:56   ` Theodore Y. Ts'o
  2019-10-04 19:12 ` [PATCH v3 00/13] ext4: add fast commit support Theodore Y. Ts'o
  12 siblings, 1 reply; 36+ messages in thread
From: Harshad Shirwadkar @ 2019-10-01  7:41 UTC (permalink / raw)
  To: linux-ext4; +Cc: Harshad Shirwadkar

This patch adds necessary documentation to
Documentation/filesystems/journalling.rst and
Documentation/filesystems/ext4/journal.rst.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 Documentation/filesystems/ext4/journal.rst | 98 ++++++++++++++++++++--
 Documentation/filesystems/journalling.rst  | 22 +++++
 2 files changed, 114 insertions(+), 6 deletions(-)

diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
index ea613ee701f5..23e7db89fc6a 100644
--- a/Documentation/filesystems/ext4/journal.rst
+++ b/Documentation/filesystems/ext4/journal.rst
@@ -29,10 +29,14 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
 disk before the metadata are written to disk through the journal.
 
 The journal inode is typically inode 8. The first 68 bytes of the
-journal inode are replicated in the ext4 superblock. The journal itself
-is normal (but hidden) file within the filesystem. The file usually
-consumes an entire block group, though mke2fs tries to put it in the
-middle of the disk.
+journal inode are replicated in the ext4 superblock. The journal
+itself is normal (but hidden) file within the filesystem. The file
+usually consumes an entire block group, though mke2fs tries to put it
+in the middle of the disk. Ext4 also utilizes JBD2's fast
+commits. Fast commits store metadata changes to inodes in an
+incremental fashion. A fast commit is valid only if there is no full
+commit after that particular fast commit. Because of this fast commit
+blocks are overwritten by a following transaction.
 
 All fields in jbd2 are written to disk in big-endian order. This is the
 opposite of ext4.
@@ -48,16 +52,18 @@ Layout
 Generally speaking, the journal has this format:
 
 .. list-table::
-   :widths: 16 48 16
+   :widths: 16 48 16 18
    :header-rows: 1
 
    * - Superblock
      - descriptor\_block (data\_blocks or revocation\_block) [more data or
        revocations] commmit\_block
      - [more transactions...]
+     - [Fast commits...]
    * - 
      - One transaction
      -
+     -
 
 Notice that a transaction begins with either a descriptor and some data,
 or a block revocation list. A finished transaction always ends with a
@@ -76,7 +82,7 @@ The journal superblock will be in the next full block after the
 superblock.
 
 .. list-table::
-   :widths: 12 12 12 32 12
+   :widths: 12 12 12 32 12 12
    :header-rows: 1
 
    * - 1024 bytes of padding
@@ -85,11 +91,13 @@ superblock.
      - descriptor\_block (data\_blocks or revocation\_block) [more data or
        revocations] commmit\_block
      - [more transactions...]
+     - [Fast commits...]
    * - 
      -
      -
      - One transaction
      -
+     -
 
 Block Header
 ~~~~~~~~~~~~
@@ -609,3 +617,81 @@ bytes long (but uses a full block):
      - h\_commit\_nsec
      - Nanoseconds component of the above timestamp.
 
+Fast Commit Block
+~~~~~~~~~~~~~~~~~
+
+The fast commit block indicates an append to the last commit block
+that was written to the journal. One fast commit block records updates
+to one inode. So, typically you would find as many fast commit blocks
+as the number of inodes that got changed since the last commit. A fast
+commit block is valid only if there is no commit block present with
+transaction ID greater than that of the fast commit block. If such a
+block a present, then there is no need to replay the fast commit
+block.
+
+Multiple fast commit blocks are a part of one sub-transaction. To
+indicate the last block in a fast commit transaction, fc_flags field
+in the last block in every subtransaction is marked with "LAST" (0x1)
+flag. A subtransaction is valid only if all the following conditions
+are met:
+
+1) SUBTID of all blocks is either equal to or greater than SUBTID of
+   the previous fast commit block.
+2) For every sub-transaction, last block is marked with LAST flag.
+3) There are no invalid blocks in between.
+
+.. list-table::
+   :widths: 8 8 24 40
+   :header-rows: 1
+
+   * - Offset
+     - Type
+     - Name
+     - Descriptor
+   * - 0x0
+     - journal\_header\_s
+     - (open coded)
+     - Common block header.
+   * - 0xC
+     - \_\_le32
+     - fc\_magic
+     - Magic value which should be set to 0xE2540090. This identifies
+       that this block is a fast commit block.
+   * - 0x10
+     - \_\_le32
+     - fc\_subtid
+     - Sub-transaction ID for this commit block
+   * - 0x14
+     - \_\_u8
+     - fc\_features
+     - Features used by this fast commit block.
+   * - 0x15
+     - \_\_u8
+     - fc_flags
+     - Flags. (0x1(Last) - Indicates that this is the last block in sub-transaction)
+   * - 0x16
+     - \_\_le16
+     - fc_num_tlvs
+     - Number of TLVs contained in this fast commit block
+   * - 0x18
+     - \_\_le32
+     - \_\_fc\_len
+     - Length of the fast commit block in terms of number of blocks
+   * - 0x2c
+     - \_\_le32
+     - fc\_ino
+     - Inode number of the inode that will be recovered using this fast commit
+   * - 0x30
+     - struct ext4\_inode
+     - inode
+     - On-disk copy of the inode at the commit time
+   * - 0x34
+     - struct ext4\_fc\_tl
+     - Array of struct ext4\_fc\_tl
+     - The actual delta with the last commit. Starting at this offset,
+       there is an array of TLVs that indicates which all extents
+       should be present in the corresponding inode. Currently,
+       following tags are supported: EXT4\_FC\_TAG\_EXT (extent that
+       should be present in the inode), EXT4\_FC\_TAG\_DNAME (dentry
+       name of the inode), EXT4\_FC\_TAG\_PARENT\_INO (inode number of
+       the directory that should contain the dentry of the inode).
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index 58ce6b395206..217f66d67f9d 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -115,6 +115,28 @@ called after each transaction commit. You can also use
 ``transaction->t_private_list`` for attaching entries to a transaction
 that need processing when the transaction commits.
 
+JBD2 also allows client file systems to implement file system specific
+commits which are called as ``fast commits``. File systems that wish
+to use this feature should first set
+``journal->j_fc_commit_callback``. That function is called before
+performing a commit. File system can call :c:func:`jbd2_map_fc_buf()`
+to get buffers reserved for fast commits. If file system returns 0,
+JBD2 assumes that file system performed a fast commit and it backs off
+from performing a commit. Otherwise, JBD2 falls back to normal full
+commit. After performing either a fast or a full commit, JBD2 calls
+``journal->j_fc_cleanup_cb`` to allow file systems to perform cleanups
+for their internal fast commit related data structures. At the replay
+time, JBD2 passes each and every fast commit block to the file system
+via ``journal->j_fc_replay_cb``. Ext4 effectively uses this fast
+commit mechanism to improve journal commit performance.
+
+It is possible for the file systems to perform fast commits
+asynchronously (without involvement of journalling thread). All file
+systems really need to do is to call :c:func:`jbd2_start_async_fc()`
+before starting the commit and call :c:func:`jbd2_stop_async_fc()`
+after the commit. This makes sure that the journalling thread and
+other async fast committers don't interfere.
+
 JBD2 also provides a way to block all transaction updates via
 :c:func:`jbd2_journal_lock_updates()` /
 :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
-- 
2.23.0.444.g18eeb5a265-goog


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/13] ext4: add fast commit support
  2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
                   ` (11 preceding siblings ...)
  2019-10-01  7:41 ` [PATCH v3 12/13] docs: Add fast commit documentation Harshad Shirwadkar
@ 2019-10-04 19:12 ` Theodore Y. Ts'o
  2019-10-04 20:11   ` harshad shirwadkar
  12 siblings, 1 reply; 36+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-04 19:12 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Tue, Oct 01, 2019 at 12:40:49AM -0700, Harshad Shirwadkar wrote:
> 
> Testing
> -------
> 
> e2fsprogs was updated to set fast commit feature flag and to ignore
> fast commit blocks during e2fsck.
> 
> https://github.com/harshadjs/e2fsprogs.git
> 
> After applying all the patches in this series, following runs of
> xfstests were performed:
> 
> - kvm-xfstest.sh -g log -c 4k
> - kvm-xfstests.sh smoke
> 
> All the log tests were successful and smoke tests didn't introduce any
> additional failures.

You should probably also try running the shutdown tests, and
eventually, run all of the auto group.  I've added a fast_commit group
to {kvm,gce}-xfstests, although to use it a modified e2fsprogs which
understands the fast_commit feature.  I can make kvm-xfstests and
gce-xfstests image using an e2fsprogs package from debian/experimental
which has fast_commit enabled.

When I tried running all of the auto group tests, the following
failure was found in generic/047 (which is a shutdown group test).

						- Ted

BEGIN TEST fast_commit (1 test): Ext4 4k block w/fast_commit Fri Oct  4 13:44:45 EDT 2019
DEVICE: /dev/vdd
EXT_MKFS_OPTIONS: -I 256 -O fast_commit,64bit
EXT_MOUNT_OPTIONS: -o block_validity
FSTYP         -- ext4
PLATFORM      -- Linux/x86_64 kvm-xfstests 5.3.0-rc4-xfstests-00012-gedca88337ca9 #1202 SMP Thu Oct 3 17:27:50 EDT 2019
MKFS_OPTIONS  -- -q -I 256 -O fast_commit,64bit /dev/vdc
MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/vdc /vdc

generic/047		[13:44:46][   24.671344] run fstests generic/047 at 2019-10-04 13:44:46
[   24.951140] EXT4-fs (vdc): shut down requested (1)
[   24.952280] Aborting journal on device vdc-8.
[   28.012724] EXT4-fs (vdc): shut down requested (2)
[   28.013639] Aborting journal on device vdc-8.
[   28.014486] 
[   28.014845] ============================================
[   28.015996] WARNING: possible recursive locking detected
[   28.017072] 5.3.0-rc4-xfstests-00012-gedca88337ca9 #1202 Not tainted
[   28.018374] --------------------------------------------
[   28.019693] jbd2/vdc-8/1476 is trying to acquire lock:
[   28.020635] 000000005ce13aef (&(&sbi->s_fc_lock)->rlock){+.+.}, at: ext4_journal_fc_cleanup_cb+0x2f/0xa0
[   28.022387] 
[   28.022387] but task is already holding lock:
[   28.023414] 000000005ce13aef (&(&sbi->s_fc_lock)->rlock){+.+.}, at: ext4_journal_fc_commit_cb+0x83/0xa90
[   28.025237] 
[   28.025237] other info that might help us debug this:
[   28.026350]  Possible unsafe locking scenario:
[   28.026350] 
[   28.027336]        CPU0
[   28.027758]        ----
[   28.028240]   lock(&(&sbi->s_fc_lock)->rlock);
[   28.029105]   lock(&(&sbi->s_fc_lock)->rlock);
[   28.029937] 
[   28.029937]  *** DEADLOCK ***
[   28.029937] 
[   28.031154]  May be due to missing lock nesting notation
[   28.031154] 
[   28.032780] 1 lock held by jbd2/vdc-8/1476:
[   28.033760]  #0: 000000005ce13aef (&(&sbi->s_fc_lock)->rlock){+.+.}, at: ext4_journal_fc_commit_cb+0x83/0xa90
[   28.035436] 
[   28.035436] stack backtrace:
[   28.036197] CPU: 1 PID: 1476 Comm: jbd2/vdc-8 Not tainted 5.3.0-rc4-xfstests-00012-gedca88337ca9 #1202
[   28.037868] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[   28.039289] Call Trace:
[   28.039772]  dump_stack+0x67/0x90
[   28.040427]  validate_chain.cold+0x1be/0x21b
[   28.041305]  __lock_acquire+0x447/0x7c0
[   28.042069]  lock_acquire+0x9a/0x180
[   28.042738]  ? ext4_journal_fc_cleanup_cb+0x2f/0xa0
[   28.043663]  _raw_spin_lock+0x31/0x80
[   28.044346]  ? ext4_journal_fc_cleanup_cb+0x2f/0xa0
[   28.045264]  ext4_journal_fc_cleanup_cb+0x2f/0xa0
[   28.046154]  jbd2_journal_commit_transaction+0x243/0x24bb
[   28.047156]  ? sched_clock_cpu+0xc/0xc0
[   28.048099]  ? lock_timer_base+0x10/0x80
[   28.048935]  ? kvm_sched_clock_read+0x14/0x30
[   28.050022]  ? sched_clock+0x5/0x10
[   28.050853]  ? sched_clock_cpu+0xc/0xc0
[   28.051793]  ? kjournald2+0x143/0x3f0
[   28.052606]  kjournald2+0x143/0x3f0
[   28.053311]  ? __wake_up_common_lock+0xc0/0xc0
[   28.054935]  kthread+0x108/0x140
[   28.055975]  ? __jbd2_debug+0x50/0x50
[   28.057105]  ? __kthread_create_on_node+0x1a0/0x1a0
[   28.058346]  ret_from_fork+0x3a/0x50

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 00/13] ext4: add fast commit support
  2019-10-04 19:12 ` [PATCH v3 00/13] ext4: add fast commit support Theodore Y. Ts'o
@ 2019-10-04 20:11   ` harshad shirwadkar
  0 siblings, 0 replies; 36+ messages in thread
From: harshad shirwadkar @ 2019-10-04 20:11 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Ext4 Developers List

Thanks for that, I fixed this deadlock, I'll run all the tests that
you mentioned.

On Fri, Oct 4, 2019 at 12:12 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Tue, Oct 01, 2019 at 12:40:49AM -0700, Harshad Shirwadkar wrote:
> >
> > Testing
> > -------
> >
> > e2fsprogs was updated to set fast commit feature flag and to ignore
> > fast commit blocks during e2fsck.
> >
> > https://github.com/harshadjs/e2fsprogs.git
> >
> > After applying all the patches in this series, following runs of
> > xfstests were performed:
> >
> > - kvm-xfstest.sh -g log -c 4k
> > - kvm-xfstests.sh smoke
> >
> > All the log tests were successful and smoke tests didn't introduce any
> > additional failures.
>
> You should probably also try running the shutdown tests, and
> eventually, run all of the auto group.  I've added a fast_commit group
> to {kvm,gce}-xfstests, although to use it a modified e2fsprogs which
> understands the fast_commit feature.  I can make kvm-xfstests and
> gce-xfstests image using an e2fsprogs package from debian/experimental
> which has fast_commit enabled.
>
> When I tried running all of the auto group tests, the following
> failure was found in generic/047 (which is a shutdown group test).
>
>                                                 - Ted
>
> BEGIN TEST fast_commit (1 test): Ext4 4k block w/fast_commit Fri Oct  4 13:44:45 EDT 2019
> DEVICE: /dev/vdd
> EXT_MKFS_OPTIONS: -I 256 -O fast_commit,64bit
> EXT_MOUNT_OPTIONS: -o block_validity
> FSTYP         -- ext4
> PLATFORM      -- Linux/x86_64 kvm-xfstests 5.3.0-rc4-xfstests-00012-gedca88337ca9 #1202 SMP Thu Oct 3 17:27:50 EDT 2019
> MKFS_OPTIONS  -- -q -I 256 -O fast_commit,64bit /dev/vdc
> MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/vdc /vdc
>
> generic/047             [13:44:46][   24.671344] run fstests generic/047 at 2019-10-04 13:44:46
> [   24.951140] EXT4-fs (vdc): shut down requested (1)
> [   24.952280] Aborting journal on device vdc-8.
> [   28.012724] EXT4-fs (vdc): shut down requested (2)
> [   28.013639] Aborting journal on device vdc-8.
> [   28.014486]
> [   28.014845] ============================================
> [   28.015996] WARNING: possible recursive locking detected
> [   28.017072] 5.3.0-rc4-xfstests-00012-gedca88337ca9 #1202 Not tainted
> [   28.018374] --------------------------------------------
> [   28.019693] jbd2/vdc-8/1476 is trying to acquire lock:
> [   28.020635] 000000005ce13aef (&(&sbi->s_fc_lock)->rlock){+.+.}, at: ext4_journal_fc_cleanup_cb+0x2f/0xa0
> [   28.022387]
> [   28.022387] but task is already holding lock:
> [   28.023414] 000000005ce13aef (&(&sbi->s_fc_lock)->rlock){+.+.}, at: ext4_journal_fc_commit_cb+0x83/0xa90
> [   28.025237]
> [   28.025237] other info that might help us debug this:
> [   28.026350]  Possible unsafe locking scenario:
> [   28.026350]
> [   28.027336]        CPU0
> [   28.027758]        ----
> [   28.028240]   lock(&(&sbi->s_fc_lock)->rlock);
> [   28.029105]   lock(&(&sbi->s_fc_lock)->rlock);
> [   28.029937]
> [   28.029937]  *** DEADLOCK ***
> [   28.029937]
> [   28.031154]  May be due to missing lock nesting notation
> [   28.031154]
> [   28.032780] 1 lock held by jbd2/vdc-8/1476:
> [   28.033760]  #0: 000000005ce13aef (&(&sbi->s_fc_lock)->rlock){+.+.}, at: ext4_journal_fc_commit_cb+0x83/0xa90
> [   28.035436]
> [   28.035436] stack backtrace:
> [   28.036197] CPU: 1 PID: 1476 Comm: jbd2/vdc-8 Not tainted 5.3.0-rc4-xfstests-00012-gedca88337ca9 #1202
> [   28.037868] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> [   28.039289] Call Trace:
> [   28.039772]  dump_stack+0x67/0x90
> [   28.040427]  validate_chain.cold+0x1be/0x21b
> [   28.041305]  __lock_acquire+0x447/0x7c0
> [   28.042069]  lock_acquire+0x9a/0x180
> [   28.042738]  ? ext4_journal_fc_cleanup_cb+0x2f/0xa0
> [   28.043663]  _raw_spin_lock+0x31/0x80
> [   28.044346]  ? ext4_journal_fc_cleanup_cb+0x2f/0xa0
> [   28.045264]  ext4_journal_fc_cleanup_cb+0x2f/0xa0
> [   28.046154]  jbd2_journal_commit_transaction+0x243/0x24bb
> [   28.047156]  ? sched_clock_cpu+0xc/0xc0
> [   28.048099]  ? lock_timer_base+0x10/0x80
> [   28.048935]  ? kvm_sched_clock_read+0x14/0x30
> [   28.050022]  ? sched_clock+0x5/0x10
> [   28.050853]  ? sched_clock_cpu+0xc/0xc0
> [   28.051793]  ? kjournald2+0x143/0x3f0
> [   28.052606]  kjournald2+0x143/0x3f0
> [   28.053311]  ? __wake_up_common_lock+0xc0/0xc0
> [   28.054935]  kthread+0x108/0x140
> [   28.055975]  ? __jbd2_debug+0x50/0x50
> [   28.057105]  ? __kthread_create_on_node+0x1a0/0x1a0
> [   28.058346]  ret_from_fork+0x3a/0x50

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 01/13] ext4: add handling for extended mount options
  2019-10-01  7:40 ` [PATCH v3 01/13] ext4: add handling for extended mount options Harshad Shirwadkar
@ 2019-10-16  2:14   ` Theodore Y. Ts'o
  2019-10-21 20:41     ` harshad shirwadkar
  0 siblings, 1 reply; 36+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-16  2:14 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, Andreas Dilger

On Tue, Oct 01, 2019 at 12:40:50AM -0700, Harshad Shirwadkar wrote:
> @@ -1858,8 +1863,9 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
>  			set_opt2(sb, EXPLICIT_DELALLOC);
>  		} else if (m->mount_opt & EXT4_MOUNT_JOURNAL_CHECKSUM) {
>  			set_opt2(sb, EXPLICIT_JOURNAL_CHECKSUM);
> -		} else
> +		} else if (m->mount_opt) {
>  			return -1;
> +		}
>  	}
>  	if (m->flags & MOPT_CLEAR_ERR)
>  		clear_opt(sb, ERRORS_MASK);

Why is this change needed?  This is in the handling of options that
have MOPT_EXPLICIT, and it doesn't seem relevant to this commit?

     		    	   	   		 - Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 02/13] jbd2: fast commit setup and enable
  2019-10-01  7:40 ` [PATCH v3 02/13] jbd2: fast commit setup and enable Harshad Shirwadkar
@ 2019-10-16 13:03   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 36+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-16 13:03 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Tue, Oct 01, 2019 at 12:40:51AM -0700, Harshad Shirwadkar wrote:
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 953990eb70a9..7c13834873ad 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1159,12 +1159,15 @@ static journal_t *journal_init_common(struct block_device *bdev,
>  	journal->j_blk_offset = start;
>  	journal->j_maxlen = len;
>  	n = journal->j_blocksize / sizeof(journal_block_tag_t);
> -	journal->j_wbufsize = n;
> +	journal->j_wbufsize = n - JBD2_FAST_COMMIT_BLOCKS;
>  	journal->j_wbuf = kmalloc_array(n, sizeof(struct buffer_head *),
>  					GFP_KERNEL);
>  	if (!journal->j_wbuf)
>  		goto err_cleanup;
>  
> +	journal->j_fc_wbuf = &journal->j_wbuf[journal->j_wbufsize];
> +	journal->j_fc_wbufsize = JBD2_FAST_COMMIT_BLOCKS;
> +
>  	bh = getblk_unmovable(journal->j_dev, start, journal->j_blocksize);
>  	if (!bh) {
>  		pr_err("%s: Cannot get buffer for journal superblock\n",

This is being done unconditionally, regardless of whether or not fast
commit has been enabled.  As a result, for the non-fc case, j_wbufsize
is going to be unconditionally reduced in size, which would be
unfortunate.

I suggest what you do is create a new function, called
jbd2_init_fast_commit() which is called from ext4_init_fast_commit(),
added in later patch, and which takes as an argument the size of the
fast_commit region (e.g., what is currently the constant
JBD2_FAST_COMMIT_BLOCKS), since this should be under the control of
the file system.

We can then pull these changes out of journal_init_common(), and move
them into jbd2_init_fast_commit().

> -/**
> - * int jbd2_journal_load() - Read journal from disk.
> - * @journal: Journal to act on.
> - *
> - * Given a journal_t structure which tells us which disk blocks contain
> - * a journal, read the journal from disk to initialise the in-memory
> - * structures.
> - */
> -int jbd2_journal_load(journal_t *journal)
> +static int __jbd2_journal_load(journal_t *journal, bool enable_fc)
>  {
>  	int err;
>  	journal_superblock_t *sb;

Instead of adding __jbd2_journal_load() with its enable_fc flag, we
can just test based on journal->j_fc_wbufsize being non-zero.  That
will have been set by jbd2_init_fast_commit(), which is called before
jbd2_journal_load().

As a result, we won't need to add __jbd2_journal_load() and the
jbd2_load_with_fc() functions.

> @@ -1684,6 +1694,12 @@ int jbd2_journal_load(journal_t *journal)
>  		return -EFSCORRUPTED;
>  	}
>  
> +	if (enable_fc)
> +		jbd2_journal_set_features(journal, 0, 0,
> +					  JBD2_FEATURE_INCOMPAT_FAST_COMMIT);
> +	else
> +		jbd2_journal_clear_features(journal, 0, 0,
> +					    JBD2_FEATURE_INCOMPAT_FAST_COMMIT);

We don't actually need to clear the feature, since it gets cleared
after the journal is successfully replayed.

> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index b7eed49b8ecd..84d04e1f3d92 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -918,6 +919,30 @@ struct journal_s
>  	 */
>  	unsigned long		j_last;
>  
> +	/**
> +	 * @j_first_fc:
> +	 *
> +	 * The block number of the first fast commit block in the journal
> +	 * [j_state_lock].
> +	 */
> +	unsigned long		j_first_fc;

Is this really protected by j_state_lock?  It's setup at journal load
time, and then never changed.  As a result, it's safe to read
j_first_fc without first taking the j_state_lock.

> +
> +	/**
> +	 * @j_fc_off:
> +	 *
> +	 * Number of fast commit blocks currently allocated.
> +	 * [j_state_lock].
> +	 */
> +	unsigned long		j_fc_off;

I'll mention this later, but we're not *actually* taking j_state_lock
when accessing j_fc_off.  In particular, jbd2_map_fc_buf() and its
caller (ext4_journal_fc_commit_cb) isn't taking j_state_lock.

I haven't had a chance to trace the locking hierarchy to figure out
whether the documentation or the locking is wrong, but my first
initial read is that the locking might be wrong?

> +
> +	/**
> +	 * @j_last_fc:
> +	 *
> +	 * The block number one beyond the last fast commit block in the journal
> +	 * [j_state_lock].
> +	 */
> +	unsigned long		j_last_fc;
> +

Again, this should never change once the journal structure is set up,
so it doesn't need to be protected by j_state_lock.

						- Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 03/13] jbd2: fast-commit commit path changes
  2019-10-01  7:40 ` [PATCH v3 03/13] jbd2: fast-commit commit path changes Harshad Shirwadkar
@ 2019-10-16 16:38   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 36+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-16 16:38 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Tue, Oct 01, 2019 at 12:40:52AM -0700, Harshad Shirwadkar wrote:
> This patch adds core fast-commit commit path changes. This patch also
> modifies existing JBD2 APIs to allow usage of fast commits. If fast
> commits are enabled and journal->j_do_full_commit is not set,

This flag should really be a property of the transaction, not the
journal.  Otherwise it might not be clear what transaction is meant in
jbd2_log_start_commit():

> @@ -522,11 +539,23 @@ int jbd2_log_start_commit(journal_t *journal, tid_t tid)
>  	int ret;
>  
>  	write_lock(&journal->j_state_lock);
> +	journal->j_do_full_commit = true;
>  	ret = __jbd2_log_start_commit(journal, tid);
>  	write_unlock(&journal->j_state_lock);
>  	return ret;
>  }

Does tid refer to the transaction which is just starting to be
committed?  Or the next transaction?

If we make the flag be attached to the transaction, then it's very
clear which transaction must be a full commit, and I think it will
simplify things.

> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 132fb92098c7..7db3e2b6336d 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> + * fc is input / output parameter. If fc is non-null and is set to true, this
> + * function tries to perform fast commit. If the fast commit is successfully
> + * performed, *fc is set to true.
>   */
> -void jbd2_journal_commit_transaction(journal_t *journal)
> +void jbd2_journal_commit_transaction(journal_t *journal, bool *fc)

I think it's going to make things much simpler if we pull out the code
which does fast commits, which was added to this function, into its
own function, jbd2_fast_commit_transaction().

Right now the logic regarding whether to do a fast commit or a full
commit is split between kjournald2() and
jbd2_journal_commit_transact() --- and once we implement asynchronous
fast commits, it doesn't make sense to even have some of this logic.

I think it will be easier if you modify the commits to add support for
asynchronous commits from the beginning.  In that world, we don't need
to have the fast commit logic inside jbd2_journal_commit_transaction(),
and that means we don't have to add the fc variable.

It also avoids a minor inconsistency in the current code, where in
order to have kjournald2() actually call
jbd2_journal_commit_transaction(), we have to bump the
j_commit_request indicating that we want to commit the current
transaction.  But then if we can do the fast commit, j_commit_request
is left indicating that there is an outstanding request that the
existing transaction be committed --- but we don't start committing
it.

That's going to be confusing for future debugging, and I could imagine
current or existing code thinking that there has already been a
request to start committing the current transaction, so it doesn't try
waking up the kjournald2 thread.

> @@ -160,7 +160,13 @@ static void commit_timeout(struct timer_list *t)
>   *
>   * 1) COMMIT:  Every so often we need to commit the current state of the
>   *    filesystem to disk.  The journal thread is responsible for writing
> - *    all of the metadata buffers to disk.
> + *    all of the metadata buffers to disk. If fast commits are allowed,
> + *    journal thread passes the control to the file system and file system
> + *    is then responsible for writing metadata buffers to disk (in whichever
> + *    format it wants). If fast commit succeds, journal thread won't perform
> + *    a normal commit. In case the fast commit fails, journal thread performs
> + *    full commit as normal.

Note: this commit needs to be updated once we are doing async fc
commits.

> @@ -702,12 +745,27 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
>  	}
>  #endif
>  	while (tid_gt(tid, journal->j_commit_sequence)) {
> -		jbd_debug(1, "JBD2: want %u, j_commit_sequence=%u\n",
> -				  tid, journal->j_commit_sequence);
> +		if ((!journal->j_do_full_commit) &&
> +		    !tid_gt(subtid, journal->j_fc_sequence))
> +			break;
> +		jbd_debug(1, "JBD2: want full commit %u %s %u, ",
> +			  tid, journal->j_do_full_commit ?
> +			  "and ignoring fast commit request for " :
> +			  "or want fast commit",
> +			  journal->j_fc_sequence);
> +		jbd_debug(1, "j_commit_sequence=%u, j_fc_sequence=%u\n",
> +			  journal->j_commit_sequence,
> +			  journal->j_fc_sequence);
>  		read_unlock(&journal->j_state_lock);
>  		wake_up(&journal->j_wait_commit);
> -		wait_event(journal->j_wait_done_commit,
> -				!tid_gt(tid, journal->j_commit_sequence));
> +		if (journal->j_do_full_commit)
> +			wait_event(journal->j_wait_done_commit,
> +				   !tid_gt(tid, journal->j_commit_sequence));
> +		else
> +			wait_event(journal->j_wait_done_commit,
> +				   !tid_gt(tid, journal->j_commit_sequence) ||
> +				   !tid_gt(subtid,
> +					    journal->j_fc_sequence));
>  		read_lock(&journal->j_state_lock);
>  	}
>  	read_unlock(&journal->j_state_lock);

This change is also not needed with async fast commits, right?

          	       	    	       - Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 04/13] jbd2: fast-commit commit path new APIs
  2019-10-01  7:40 ` [PATCH v3 04/13] jbd2: fast-commit commit path new APIs Harshad Shirwadkar
@ 2019-10-16 17:20   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 36+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-16 17:20 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Tue, Oct 01, 2019 at 12:40:53AM -0700, Harshad Shirwadkar wrote:
> This patch adds new helper APIs that ext4 needs for fast
> commits. These new fast commit APIs are used by subsequent fast commit
> patches to implement fast commits. Following new APIs are added:
> 
> /*
>  * Returns when either a full commit or a fast commit
>  * completes
>  */
> int jbd2_fc_complete_commit(journal_tc *journal, tid_t tid,
> 			                tid_t subtid)
> 
> /* Send all the data buffers related to an inode */
> int journal_submit_inode_data(journal_t *journal,
> 			                  struct jbd2_inode *jinode)
> 
> /* Map one fast commit buffer for use by the file system */
> int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out)
> 
> /* Wait on fast commit buffers to complete IO */
> jbd2_wait_on_fc_bufs(journal_t *journal, int num_bufs)
> 
> /*
>  * Returns 1 if transaction identified by tid:subtid is already
>  * committed.
>  */
> int jbd2_commit_check(journal_t *journal, tid_t tid, tid_t subtid)

Please move these commits into the code, before each function.  This
documentation is going to be useful long after the patch gets merged,
and people will be looking for them in the source code, and not
necessarily in the commit description.

> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 7db3e2b6336d..e85f51e1cc70 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -202,6 +202,38 @@ static int journal_submit_inode_data_buffers(struct address_space *mapping,
>  	return ret;
>  }
>  
> +int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode)

This code was pulled out of journal_submit_data_buffers(), but given
how it was called, there were locking assumptions that were broken as
a result.

> +{
> +	struct address_space *mapping;
> +	loff_t dirty_start = jinode->i_dirty_start;
> +	loff_t dirty_end = jinode->i_dirty_end;
> +	int ret;
> +
> +	if (!jinode)
> +		return 0;
> +
> +	if (!(jinode->i_flags & JI_WRITE_DATA))
> +		return 0;

Originally in journal_submit_data_buffers() we were holding onto
j_list_lock, and that's needed to safely reference jinode->i_flags

> +
> +	dirty_start = jinode->i_dirty_start;
> +	dirty_end = jinode->i_dirty_end;
> +
> +	mapping = jinode->i_vfs_inode->i_mapping;
> +	jinode->i_flags |= JI_COMMIT_RUNNING;

Originally there was a spin_uinlock(&journal->j_list_lock) here.  And
that's important since there was a memory barrier there which we
needed in order to make sure other CPU's would see the
JI_COMMIT_RUNNING flag.

It's not clear we need to worry about this, if this is only going to
be used in the async fast commit context.  This is another example of
how trying to do the fast commit in the userspace (or nfs server's)
process context is much simpler, since the the JI_COMMIT_RUNNING flag
is needed to make sure there isn't a race with the inode getting
evicted and jbd2_journal_release_jbd_inode.

And if we're calling this function from ext4_jbd2.c, where the inode's
ref count is elevated and there is no risk of the inode getting
evicted from memory, then this particular race is not a problem, and
so messing with JI_COMMIT_RUNNING and the call to wake_up_bit is all
not necessary.

By the way, this function only submits the data to be written out.  It
does not wait for the writeout to be completed.  For that, you need
the equivalent of journal_finish_inode_data_buffers(), and I don't see
that equivalent functionality in the fast commit code path?

     			      	     - Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 05/13] jbd2: fast-commit recovery path changes
  2019-10-01  7:40 ` [PATCH v3 05/13] jbd2: fast-commit recovery path changes Harshad Shirwadkar
@ 2019-10-16 17:30   ` Theodore Y. Ts'o
  2019-10-22  0:51     ` harshad shirwadkar
  0 siblings, 1 reply; 36+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-16 17:30 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Tue, Oct 01, 2019 at 12:40:54AM -0700, Harshad Shirwadkar wrote:
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 14d549445418..e0684212384d 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
>  
>  	jbd2_write_superblock(journal, write_op);
>  
> +	if (had_fast_commit)
> +		jbd2_set_feature_fast_commit(journal);
> +

Why the logic with had_fast_commit and (re-)setting the fast commit
feature flag?

This ties back to how we handle the logic around setting the fast
commit flag if requested by the file system....

> @@ -768,6 +816,8 @@ static int do_one_pass(journal_t *journal,
>  			if (err)
>  				goto failed;
>  			continue;
> +		case JBD2_FC_BLOCK:
> +			continue;

Why should a Fast Commit block ever show up in the primary part of the
journal?   It should never happen, right?

In which case, we should probably at least issue a warning, and not
just skip the block.

					- Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 06/13] ext4: add fields that are needed to track changed files
  2019-10-01  7:40 ` [PATCH v3 06/13] ext4: add fields that are needed to track changed files Harshad Shirwadkar
@ 2019-10-16 18:26   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 36+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-16 18:26 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Tue, Oct 01, 2019 at 12:40:55AM -0700, Harshad Shirwadkar wrote:
> +/*
> + * Ext4 fast commit inode specific information
> + */
> +struct ext4_fast_commit_inode_info {

I think it would be better to move the contents of this structure
directly into ext4_inode_info, instead of adding this structure to
ext4_inode_info; the structure is never used in a free-standing
context.

> +	/*
> +	 * Flag indicating whether this inode is eligible for fast commits or
> +	 * not.
> +	 */
> +	bool fc_eligible;
> +
> +	/*
> +	 * Flag indicating whether this inode is newly created during this
> +	 * tid:subtid.
> +	 */
> +	bool fc_new;

These two bools could be replaced using EXT4_STATE_* flags.  Grep for
EXT4_STATE_NEWENTRY to see an example of how an EXT4_STATE_ flag is
defined and used.


> +	rwlock_t fc_lock;

What is this used for?  If it's only just to protect the i_fc_list
list_head, maybe name it i_fc_list_lock?

> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 764ff4c56233..ff30f3015551 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -1131,6 +1131,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>  
>  	ext4_clear_state_flags(ei); /* Only relevant on 32-bit archs */
>  	ext4_set_inode_state(inode, EXT4_STATE_NEW);
> +	ext4_init_inode_fc_info(inode);
>  
>  	ei->i_extra_isize = sbi->s_want_extra_isize;
>  	ei->i_inline_off = 0;

I don't think this is necessary; the inode was returned by ext4_iget,
so the ext4_alloc_inode() will have already called that function.


> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 420fe3deed39..f230a888eddd 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4996,6 +4996,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
>  	for (block = 0; block < EXT4_N_BLOCKS; block++)
>  		ei->i_data[block] = raw_inode->i_block[block];
>  	INIT_LIST_HEAD(&ei->i_orphan);
> +	ext4_init_inode_fc_info(&ei->vfs_inode);
>  

The inode here was returned by iget_locked(), which means
ext4_alloc_inode() will have been called.

> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 7725eb2105f4..c90337fc98c1 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1139,6 +1140,7 @@ static void init_once(void *foo)
>  	init_rwsem(&ei->i_data_sem);
>  	init_rwsem(&ei->i_mmap_sem);
>  	inode_init_once(&ei->vfs_inode);
> +	ext4_init_inode_fc_info(&ei->vfs_inode);
>  }

Maybe pull the rwlock_init() out of ext4_init_inode_fc_info() and
stuff it here?

Basically, it looks like certain fields are getting redundantly
initalized a lot.  The init_once function will initialize those fields
that will be reset when the structure is released.  If we are sure
that it will be reset (e.g., the spinlock will be reset), then we can
initialize it once in init_once() and then not re-initializing in
other places, such as ext4_alloc_inode().

There are some people who think it's not worth it to avoid using
init_once, since this can cause bugs if it turns out it wasn't
properly reset at the time when the object is released.  So the other
approach is to drop the ext4_init_inode_fc_info() and then just
reinitialize the spinlock every time.  (OTOH, if someone else is still
holding on the spinlock when you release it, then reinitialize the
spinlock can *also* lead to a very hard-to-debug crash.)

	     	    	      	   		 - Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 07/13] ext4: track changed files for fast commit
  2019-10-01  7:40 ` [PATCH v3 07/13] ext4: track changed files for fast commit Harshad Shirwadkar
@ 2019-10-16 20:26   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 36+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-16 20:26 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Tue, Oct 01, 2019 at 12:40:56AM -0700, Harshad Shirwadkar wrote:
> +void ext4_fc_enqueue_inode(handle_t *handle, struct inode *inode)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> +	tid_t running_txn_tid = get_running_txn_tid(inode->i_sb);

BTW, we don't actually have to call get_running_txn_tid() here.  We
have a handle, which means we know there is a running transaction, so
we can also just do:

	tid_t running_txn_tid = handle->h_transaction->t_id;

> +
> +	if (!ext4_should_fast_commit(inode->i_sb))
> +		return;
> +
> +	spin_lock(&sbi->s_fc_lock);

This is going to be a major lock contention bottleneck.  So we should
move the the write_lock of &ei->i_fc.fc_lock and comparison of
ei->i_fc.fc_tid against running_txn_tid before we try to take the file
system-level s_fc_lock.

> +	if (!sbi->s_fc_eligible) {
> +		spin_unlock(&sbi->s_fc_lock);
> +		return;
> +	}

I'm really not fond the file system level s_fc_eligible; again, I
really think we should have a transaction-level "this transaction is
not eligible for fast commit" flag.  We don't have to be super careful
about locking for this flag anyway, since it only transitions from set
to unset, and here in ext4_fc_enqueue_inode(), it's only an
optimization to avoid doing extra unnecessary work.

> +static inline void
> +ext4_fc_mark_ineligible(struct inode *inode)
> +{
> +	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	write_lock(&ei->i_fc.fc_lock);
> +	if (sbi->s_journal)
> +		ei->i_fc.fc_tid = sbi->s_journal->j_commit_sequence + 1;

Use get_running_txn_tid() instead?

> +	ei->i_fc.fc_eligible = false;
> +	write_unlock(&ei->i_fc.fc_lock);
> +	spin_lock(&sbi->s_fc_lock);
> +	sbi->s_fc_eligible = false;
> +	spin_unlock(&sbi->s_fc_lock);
> +}
> +

> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f230a888eddd..6d2efbd9aba9 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -279,6 +280,8 @@ void ext4_evict_inode(struct inode *inode)
>  	if (ext4_inode_is_fast_symlink(inode))
>  		memset(EXT4_I(inode)->i_data, 0, sizeof(EXT4_I(inode)->i_data));
>  	inode->i_size = 0;
> +	ext4_fc_del(inode);
> +	ext4_fc_mark_ineligible(inode);

Why is ext4_fc_mark_ineligible() needed here?

> @@ -326,6 +330,8 @@ void ext4_evict_inode(struct inode *inode)
>  	 * having errors), but we can't free the inode if the mark_dirty
>  	 * fails.
>  	 */
> +	ext4_fc_del(inode);
> +	ext4_fc_mark_ineligible(inode);

Same question here....

> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 442f7ef873fc..a8e23acb5c03 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -987,6 +987,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>  		err = mnt_want_write_file(filp);
>  		if (err)
>  			return err;
> +		ext4_fc_mark_sb_ineligible(sb);
>  		err = swap_inode_boot_loader(sb, inode);
>  		mnt_drop_write_file(filp);
>  		return err;

I don't think we need to mark the whole file system (transaction) as
ineligible.  We just have to mark the two inodes being marked as
ineligible, no?

> @@ -997,6 +998,8 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>  		int err = 0, err2 = 0;
>  		ext4_group_t o_group = EXT4_SB(sb)->s_groups_count;
>  
> +		ext4_fc_mark_sb_ineligible(sb);
> +
>  		if (copy_from_user(&n_blocks_count, (__u64 __user *)arg,
>  				   sizeof(__u64))) {
>  			return -EFAULT;

This is the resize ioctl, and this is the one place where we need to
mark the whole transaction as fc ineligible, since some other
subsequent handle might try to allocate blocks or inodes that were
created as the result of EXT4_IOC_RESIZE_FS.

But we shouldn't actually do it here; we should do it whenever we
start a handle that tries to resize the file system, since it is
*that* transaction that we need to make sure is made ineligible.
Otherwise there can be races where we set the flag in sbi, but before
we have a chance to start the handle which does (part of) the resize
operation, it gets cleared because another transaction committed
first.

We similarly need to mark the transaction is ineligible for any
handles created as the result of EXT4_IOC_GROUP_ADD and
EXT4_IOC_GROUP_EXTEND.  (Which are the old/legacy resize ioctl.)

> diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
> index b1e4d359f73b..b995690d73ce 100644
> --- a/fs/ext4/migrate.c
> +++ b/fs/ext4/migrate.c
> @@ -513,6 +513,7 @@ int ext4_ext_migrate(struct inode *inode)
>  		 * work to orphan_list_cleanup()
>  		 */
>  		ext4_orphan_del(NULL, tmp_inode);
> +		ext4_fc_del(inode);

This should be tmp_inode, not inode; and I don't think it's needed,
since the tmp inode will never have been fast commit enqueued.

> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 491f9ee4040e..19bc4046658c 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1406,6 +1406,7 @@ static int ext4_xattr_inode_write(handle_t *handle, struct inode *ea_inode,
>  	inode_unlock(ea_inode);
>  
>  	ext4_mark_inode_dirty(handle, ea_inode);
> +	ext4_fc_enqueue_inode(handle, ea_inode);

If we modify an external xattr block, or if we need to create (or
modify the ref count) on an EA inode, we need to disable fast commit
on the inode whose xattrs we are manipulating.  Could you add that
logic, please?

We could add support for writing out the external xattr block to the
fast commit log if it has been modified, but that's a fast commit
change in its journal format.

					- Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 08/13] ext4: fast-commit commit range tracking
  2019-10-01  7:40 ` [PATCH v3 08/13] ext4: fast-commit commit range tracking Harshad Shirwadkar
@ 2019-10-16 21:36   ` Theodore Y. Ts'o
  2019-10-30  5:12     ` harshad shirwadkar
  0 siblings, 1 reply; 36+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-16 21:36 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Tue, Oct 01, 2019 at 12:40:57AM -0700, Harshad Shirwadkar wrote:
> With this patch, we track logical range of file offsets that need to
> be committed using fast commit. This allows us to find file extents
> that need to be committed during the commit time.

We don't actually need to track when data is modified in the page
cache, which is what this commit is actually doing.  We only need to
track newly allocated blocks, at granularity of the logical block
number.

That's because we only need to force out newly allocated blocks to
make sure we don't reveal stale data when we are in data=ordered mode.
And it also follows that we don't need to track logical block ranges
and submit inode data in data=writeback or data=journalled mode.

In the case where the user has actually called fsync() on the the
inode, we do a data integrity writeback in ext4_sync_file, and that's
independent on the fast commit code.

But if the file is being modified using buffered writes, or if an
already allocated block is changed, and the file has *not* been
changed, we don't need to write out those blocks on a fast commit.
For example, in the case where we are the fast commit is being
initiated via ext4_nfs_commit_metadata() -> ext4_write_inode(), we
only care about submitting data for the newly allocated blocks.  And
that's what we want to track here.

Hence, all of the callers of ext4_fc_update_commit_range() here are in
the wrong place.  (Also, they are calling ext4_fc_update_commit_range
with byte offsets, when the function is expecting logical block
numbers, but that really matter, since the existing call sites need to
be all removed and replaced with new ones in ext4_map_blocks().

       	       	   	    	     - Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 09/13] ext4: fast-commit commit path changes
  2019-10-01  7:40 ` [PATCH v3 09/13] ext4: fast-commit commit path changes Harshad Shirwadkar
@ 2019-10-16 22:45   ` Theodore Y. Ts'o
       [not found]     ` <CAAJeciXQiE022GqcsTr35jSqjA6eH+zBS2KNvDPj5PovButdYA@mail.gmail.com>
  0 siblings, 1 reply; 36+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-16 22:45 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Tue, Oct 01, 2019 at 12:40:58AM -0700, Harshad Shirwadkar wrote:
> This patch implements the actual commit path for fast commit. Based on
> inodes tracked and their respective changes remembered, this
> patch adds code to create a fast commit block that stores extents
> added as well as dentrys created for the inode. We use new JBD2
> interfaces added in previous patches in this series. The fast commit
> blocks that are created have extents that _should_ be present in the
> file. It doesn't yet support removing of extents, making operations
> such as truncate, delete fast commit incompatible.

This affects some of the earlier patches, but I didn't realize this
until now.  Right now, what we're doing is when initiate an fast
commit, we are writing out all fast-commit eligible inodes (and
flushing out any associated data blocks needed to maintain
data=ordered guarantees).

We don't actually have to do this.  Strictly speaking, we only have to
write out the specific inode being fsync'ed, or the specific inode for
which ext4_nfs_commit_metdata() has been called.  For an fsync()
workload, especially one where for example, we might have hundreds of
modified inodes, all of which are fc-eligible --- for example, because
a kernel build is happening in the background, and a single file which
is being fsync'ed --- for example because the programmer has just
saved a source file in emacs ---- we only need to include that single
inode in the fast commit.  Including *all* of the inodes in the
i_fc_list in the fast commit, is wasted effort, especially since the
inodes in question will be committed within the next 5 seconds.

Now, in the case of ext4_nfs_commit_metadata(), we know that NFS is
*very* aggressive at calling commit_metadata, and so writing out all
of the FC-eligible commit is probably a good thing to do.  So we might
want to do different things depending on whether the FC is being
initiated via fsync() or fdatasync() versus commit_metadata().

The other reason why it's better to only do this for
ext4_nfs_commit_metadata() is because if we only write out the inode
which is being fsync'ed, we don't have worry about fairness concerned,
since the I/O will be charged to the process/cgroup who requested the
fsync.  If we write out *all* the fc-eligible inodes in the FC commit,
then they will get charged to the process doing the fsync(2).  Whereas
for an NFS server, we don't care about cgroups, since they can all be
charged to the NFS server.


> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index 0bb8de2139a5..fd7740372438 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -4,6 +4,7 @@
> +static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
> +{
> +	struct buffer_head *orig_bh = bh->b_private;
> +
> +	BUFFER_TRACE(bh, "");
> +	if (uptodate) {
> +		ext4_debug("%s: Block %lld up-to-date",
> +			   __func__, bh->b_blocknr);
> +		set_buffer_uptodate(bh);
> +	} else {
> +		ext4_debug("%s: Block %lld not up-to-date",
> +			   __func__, bh->b_blocknr);
> +		clear_buffer_uptodate(bh);
> +	}
> +	if (orig_bh) {
> +		clear_bit_unlock(BH_Shadow, &orig_bh->b_state);
> +		/* Protect BH_Shadow bit in b_state */
> +		smp_mb__after_atomic();
> +		wake_up_bit(&orig_bh->b_state, BH_Shadow);
> +	}

We don't need to deal with BH_Shadow handling here.  This is needed
when we are writing out buffer heads correspond to ext4 metadata
(e.g., an inode table block, a block group descriptor block).  We're
only writing out bh's corresponding to the journal, so the BH_Shadow
bit should never be set on such bh's.

> +static inline u8 *fc_add_tag(u8 *dst, u16 tag, u16 len, u8 *val)

Can you add some documentation for this function?  In particular, what
does it return?  I also tend to prefer to pass in the pointer to the
buffer (val) first, followed then by the length (len), but that's more
of a personal preference.

> +int ext4_fc_write_inode(journal_t *journal, struct buffer_head *bh,
> +			struct inode *inode, tid_t tid, tid_t subtid,
> +			int is_last, struct dentry *dentry)
> +{

  ...
> +
> +	memcpy(&fc_hdr->inode, ext4_raw_inode(&iloc), EXT4_INODE_SIZE(sb));

So this is a bit problematic.  In the structure definition,
fc_hdr->inode is not at the end of the structure

struct ext4_fc_commit_hdr {
	/* Fast commit magic, should be EXT4_FC_MAGIC */
	__le32 fc_magic;
	...	
	/* ext4 inode on disk copy */
	struct ext4_inode inode;
	/* Csum(hdr+contents) */
	__le32 fc_csum;
};

... and the size of struct ext4_inode is just the fixed portion of the
inode, and is almost always smaller than EXT4_INODE_SIZE(sb) ---
except in the case of 128 byte inodes, in which case the fields
i_extra_isize and beyond going to be beyond the 128 byte limit.

So this isn't going to work.  I'm guessing you didn't test with
extended attributes, because the checksum would have overwritten the
beginning of the in-inode xattrs?

Also, note that EXT4_INODE_SIZE(sb) can be set to the block size.
It's super-rare, but that is legal.  Which means we need to test for
that case somewhere, and either (a) disable fast commits when the
inode size == blocksize, or (b) support a fast commit log which is
larger than a single block.  (This is doable, since there is a
checksum field to protect against partial writes.)

> +struct ext4_fc_commit_hdr {
> +	/* Fast commit magic, should be EXT4_FC_MAGIC */
> +	__le32 fc_magic;
> +	/* Sub transaction ID */
> +	__le32 fc_subtid;
> +	/* Features used by this fast commit block */
> +	__u8 fc_features;
> +	/* Flags for this block. */
> +	__u8 fc_flags;

What fs_features and fc_flags are you thinking we would need?  I can't
think of a good reasons to have per-fc block features.  But I can
think of reasons why we might want to support a small number of blocks
in an fc entry.  So maybe repurpose fc_features with some limit, such
as say, 4 blocks, and on the replay side we can just kmalloc 4 *
blocksize worth of space to read in that number of blocks if
necessary?

> +	/* Number of TLVs in this fast commmit block */
> +	__le16 fc_num_tlvs;
> +	/* Inode number */
> +	__le32 fc_ino;
> +	/* ext4 inode on disk copy */
> +	struct ext4_inode inode;
> +	/* Csum(hdr+contents) */
> +	__le32 fc_csum;

I'd suggest putting the checksum at the very end of the fc entry.
e.g., at offset 4092 if there is only a single block in the fc commit
entry.  Also, I'd make sure that we explicitly zero all of the bytes
at the end of the TLV section and the checksum, and specify that the
checksum is calculated including the must-be-zero padding, just to
keep things simple.

						- Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 12/13] docs: Add fast commit documentation
  2019-10-01  7:41 ` [PATCH v3 12/13] docs: Add fast commit documentation Harshad Shirwadkar
@ 2019-10-18  1:56   ` Theodore Y. Ts'o
  2019-10-18  4:51     ` Andreas Dilger
  2019-10-31  5:34     ` harshad shirwadkar
  0 siblings, 2 replies; 36+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-18  1:56 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Tue, Oct 01, 2019 at 12:41:01AM -0700, Harshad Shirwadkar wrote:
> +
> +Multiple fast commit blocks are a part of one sub-transaction. To
> +indicate the last block in a fast commit transaction, fc_flags field
> +in the last block in every subtransaction is marked with "LAST" (0x1)
> +flag. A subtransaction is valid only if all the following conditions
> +are met:
> +
> +1) SUBTID of all blocks is either equal to or greater than SUBTID of
> +   the previous fast commit block.
> +2) For every sub-transaction, last block is marked with LAST flag.
> +3) There are no invalid blocks in between.

I'm wondering why we need to support multiple inodes being modified in
a single transaction.  As we currently have defined what can be done,
all updates to an inode should be free standing and not dependent on a
change to another inode, right?  And today, one block only modifies
one inode.

The only reason why we might want to define a sub-transaction as being
composed of multiple inodes, which must all be updated in an
all-or-nothing fashion, is the swap boot inode ioctl, and if that's
the only one, I wonder if it's worth the extra complexity.

Am I missing anything?

					- Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 10/13] ext4: fast-commit recovery path changes
  2019-10-01  7:40 ` [PATCH v3 10/13] ext4: fast-commit recovery " Harshad Shirwadkar
@ 2019-10-18  2:07   ` Theodore Y. Ts'o
  0 siblings, 0 replies; 36+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-18  2:07 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Tue, Oct 01, 2019 at 12:40:59AM -0700, Harshad Shirwadkar wrote:
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index 0b202e00d93f..2433f12d2d88 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -360,7 +360,12 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
>  				      struct buffer_head *bh)
>  {
>  	ext4_fsblk_t	blk;
> -	struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
> +	struct ext4_group_info *grp;
> +
> +	if (EXT4_SB(sb)->s_fc_replay)
> +		return 0;

Instead of adding a bool (s_fc_replay) to sbi, why not just use
sbi->s_mount_state and define a new bit, EXT4_REPLAY_FC (alongside
EXT4_ORPHAN_FS, et. al)?

> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index fd7740372438..12d6e70bf676 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c

> +int ext4_fc_create_inode(struct super_block *sb, struct ext4_inode *raw_inode,
> +			 int ino, unsigned long parent, const char *dname,
> +			 int dlen)
> +{
> +	struct inode *dir = NULL, *inode = NULL;
> +	struct dentry *dentry_dir = NULL, *dentry_inode = NULL;
> +	struct qstr qstr_dname = QSTR_INIT(dname, dlen);
> +	struct ext4_dir_entry_2 *res_dir = NULL;
> +	struct buffer_head *dirent_bh;
> +	int ret = 0, inlined;
> +
	...
> +		if (le32_to_cpu(res_dir->inode) != inode->i_ino) {
> +			jbd_debug(1, "Entry exists and mismatched inode nos.");
> +			brelse(dirent_bh);
> +			ret = -EEXIST;
> +			goto out;


We have a number of statements where ret gets set to an error, but
then when look at what happens after the out label...

> +out:
	...
> +
> +	return 0;
> +}

It always returns 0; I think we should be returning ret?


> +static int ext4_journal_fc_replay_cb(journal_t *journal, struct buffer_head *bh,
> +				     enum passtype pass, int off)
> +{
> +	struct super_block *sb = journal->j_private;
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_fc_commit_hdr *fc_hdr;
> +	struct ext4_fc_tl *tl;
> +	struct ext4_iloc iloc;
> +	struct ext4_extent *ex;
> +	struct inode *inode;
> +	char *dname = NULL;
> +	int dname_len = 0;
> +	int parent_ino = -1;
> +	int i, j, ret;
> +
> +	if (pass == PASS_SCAN)
> +		return ext4_journal_fc_replay_scan(sb, bh, off);
> +
> +	if (sbi->s_fc_replay_state.fc_replay_error) {
> +		jbd_debug(1, "FC replay error set = %d\n",
> +			  sbi->s_fc_replay_state.fc_replay_error);
> +		return sbi->s_fc_replay_state.fc_replay_error;
> +	}
> +
> +	sbi->s_fc_replay = true;
> +	fc_hdr = (struct ext4_fc_commit_hdr *)
> +		  ((__u8 *)bh->b_data + sizeof(journal_header_t));
> +
> +	jbd_debug(3, "%s: Got FC block for inode %d at [%d,%d]", __func__,
> +		  le32_to_cpu(fc_hdr->fc_ino),
> +		  be32_to_cpu(((journal_header_t *)bh->b_data)->h_sequence),
> +		  le32_to_cpu(fc_hdr->fc_subtid));
> +
> +	tl = (struct ext4_fc_tl *)(fc_hdr + 1);
> +	if (le16_to_cpu(fc_hdr->fc_num_tlvs) >= 2) {
> +		for (i = 0; i < 2; i++) {
> +			switch (le16_to_cpu(tl->fc_tag)) {
> +			case EXT4_FC_TAG_DNAME:
> +				dname = fc_tag_val(tl);
> +				dname_len = fc_tag_len(tl);
> +				break;
> +			case EXT4_FC_TAG_PARENT_INO:
> +				parent_ino = le32_to_cpu(
> +				    *(__le32 *)fc_tag_val(tl));
> +				break;
> +			}
> +			tl = (struct ext4_fc_tl *)(fc_tag_val(tl) +
> +						   fc_tag_len(tl));
> +		}
> +	}
> +
> +	if (parent_ino && dname) {
> +		ret = ext4_fc_create_inode(sb, &fc_hdr->inode,
> +				     le32_to_cpu(fc_hdr->fc_ino), parent_ino,
> +				     dname, dname_len);
> +		if (ret) {
> +			jbd_debug(1, "Failed to create ext4 inode.");
> +			return ret;
> +		}
> +	}
> +
> +	inode = ext4_iget(sb, le32_to_cpu(fc_hdr->fc_ino), EXT4_IGET_NORMAL);
> +	if (IS_ERR(inode))
> +		return 0;
> +
> +	ret = ext4_get_inode_loc(inode, &iloc);
> +	if (ret)
> +		return ret;
> +
> +	inode_lock(inode);
> +	tl = (struct ext4_fc_tl *)(fc_hdr + 1);
> +	for (i = 0; i < le16_to_cpu(fc_hdr->fc_num_tlvs); i++) {
> +		switch (le16_to_cpu(tl->fc_tag)) {
> +		case EXT4_FC_TAG_EXT:
> +			ex = (struct ext4_extent *)(tl + 1);
> +			/*
> +			 * We add block by block because part of extent may
> +			 * already have been added by a previous fast commit
> +			 * replay.
> +			 */
> +			for (j = 0; j < ext4_ext_get_actual_len(ex); j++)
> +				ext4_fc_add_block(inode,
> +						  le32_to_cpu(ex->ee_block) + j,
> +						  ext4_ext_pblock(ex) + j,
> +						  ext4_ext_is_unwritten(ex));
> +			break;
> +		case EXT4_FC_TAG_PARENT_INO:
> +		case EXT4_FC_TAG_DNAME:
> +			break;
> +		default:
> +			jbd_debug(1, "Unknown tag found.\n");
> +		}
> +		tl = (struct ext4_fc_tl *)((__u8 *)tl +
> +					   le16_to_cpu(tl->fc_len) +
> +					   sizeof(*tl));
> +	}
> +	ext4_reserve_inode_write(NULL, inode, &iloc);
> +	inode_unlock(inode);
> +
> +	/*
> +	 * Unless inode contains inline data, copy everything except
> +	 * i_blocks. i_blocks would have been set alright by ext4_fc_add_block
> +	 * call above.
> +	 */
> +	if (ext4_has_inline_data(inode)) {
> +		memcpy(ext4_raw_inode(&iloc), &fc_hdr->inode,
> +		       sizeof(struct ext4_inode));
> +	} else {
> +		memcpy(ext4_raw_inode(&iloc), &fc_hdr->inode,
> +		       offsetof(struct ext4_inode, i_block));
> +		memcpy(&ext4_raw_inode(&iloc)->i_generation,
> +		       &fc_hdr->inode.i_generation,
> +		       sizeof(struct ext4_inode) -
> +		       offsetof(struct ext4_inode, i_generation));
> +	}
> +	inode->i_generation = le32_to_cpu(ext4_raw_inode(&iloc)->i_generation);
> +	ext4_reset_inode_seed(inode);
> +
> +	ext4_inode_csum_set(inode, ext4_raw_inode(&iloc), EXT4_I(inode));
> +	ret = ext4_handle_dirty_metadata(NULL, inode, iloc.bh);
> +	brelse(iloc.bh);
> +	iput(inode);
> +	if (!ret)
> +		ret = blkdev_issue_flush(sb->s_bdev, GFP_KERNEL, NULL);
> +
> +	sbi->s_fc_replay = false;
> +
> +	return ret;
> +}
> +
>  void ext4_init_fast_commit(struct super_block *sb, journal_t *journal)
>  {
>  	if (ext4_should_fast_commit(sb)) {
>  		journal->j_fc_commit_callback = ext4_journal_fc_commit_cb;
>  		journal->j_fc_cleanup_callback = ext4_journal_fc_cleanup_cb;
>  	}
> +
> +	/*
> +	 * We set replay callback even if fast commit disabled because we may
> +	 * could still have fast commit blocks that need to be replayed even if
> +	 * fast commit has now been turned off.
> +	 */
> +	journal->j_fc_replay_callback = ext4_journal_fc_replay_cb;
>  }
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index dea4c2632272..d70c09cbbc3f 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -2903,9 +2903,11 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
>  	ext_debug("truncate since %u to %u\n", start, end);
>  
>  	/* probably first extent we're gonna free will be last in block */
> -	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, depth + 1);
> -	if (IS_ERR(handle))
> -		return PTR_ERR(handle);
> +	if (!sbi->s_fc_replay) {
> +		handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, depth + 1);
> +		if (IS_ERR(handle))
> +			return PTR_ERR(handle);
> +	}


I'm curious; what fast commits will result in our needing to call
ext4_ext_remove_space?  I thought we weren't supporting truncate,
punch hole, etc.

> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 47d04a33a3ca..d32dea0757fe 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -287,15 +292,17 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
	...
> +	if (!sbi->s_fc_replay) {
> +		grp = ext4_get_group_info(sb, block_group);
> +		if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp))) {
> +			fatal = -EFSCORRUPTED;
> +			goto error_return;

And ditto for ext4_free_inode?

> @@ -758,7 +765,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,

And I'm surprised we're want to use ext4_new_inode for fast commit,
since for fast commit, we already know what inode number should be
used for a newly created file.  ext4_new_inode() is going to be
searching for what inode to allocate which we wouldn't need to do for
fast_commit, no?

						- Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 12/13] docs: Add fast commit documentation
  2019-10-18  1:56   ` Theodore Y. Ts'o
@ 2019-10-18  4:51     ` Andreas Dilger
  2019-10-18 13:28       ` Theodore Y. Ts'o
  2019-10-31  5:34     ` harshad shirwadkar
  1 sibling, 1 reply; 36+ messages in thread
From: Andreas Dilger @ 2019-10-18  4:51 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Harshad Shirwadkar, linux-ext4

What about rename or hard link?

Cheers, Andreas

> On Oct 18, 2019, at 10:56, Theodore Y. Ts'o <tytso@mit.edu> wrote:
> 
>> On Tue, Oct 01, 2019 at 12:41:01AM -0700, Harshad Shirwadkar wrote:
>> +
>> +Multiple fast commit blocks are a part of one sub-transaction. To
>> +indicate the last block in a fast commit transaction, fc_flags field
>> +in the last block in every subtransaction is marked with "LAST" (0x1)
>> +flag. A subtransaction is valid only if all the following conditions
>> +are met:
>> +
>> +1) SUBTID of all blocks is either equal to or greater than SUBTID of
>> +   the previous fast commit block.
>> +2) For every sub-transaction, last block is marked with LAST flag.
>> +3) There are no invalid blocks in between.
> 
> I'm wondering why we need to support multiple inodes being modified in
> a single transaction.  As we currently have defined what can be done,
> all updates to an inode should be free standing and not dependent on a
> change to another inode, right?  And today, one block only modifies
> one inode.
> 
> The only reason why we might want to define a sub-transaction as being
> composed of multiple inodes, which must all be updated in an
> all-or-nothing fashion, is the swap boot inode ioctl, and if that's
> the only one, I wonder if it's worth the extra complexity.
> 
> Am I missing anything?
> 
>                    - Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 12/13] docs: Add fast commit documentation
  2019-10-18  4:51     ` Andreas Dilger
@ 2019-10-18 13:28       ` Theodore Y. Ts'o
  2019-10-31 18:53         ` Andreas Dilger
  0 siblings, 1 reply; 36+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-18 13:28 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Harshad Shirwadkar, linux-ext4

On Fri, Oct 18, 2019 at 01:51:56PM +0900, Andreas Dilger wrote:
> What about rename or hard link?

Neither is currently handled by the fast commit patches, but each
operation can fit inside a single block, so it could be handled as a
update to a single inode.  In the case of rename, we will need to add
some tags to indicate the desintation directory and directory enrty
name, and whether or not there is a destination inode which needs to
have its refcount dropped and possibly deleted.

Harshad, we probably should handle them, since in order to support
NFS, the nfs server will send the rename or hard link request,
followed by a commit metadata request, and that commit metadata
request needs to persist the rename or link.  So for the purposes of
accelerating NFS, we should handle these commands.

If we don't handle these commands, we will need to declare the inode
as fast commit ineligible, so that we force a full journal commit when
the commit metadata request is received.

						- Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 01/13] ext4: add handling for extended mount options
  2019-10-16  2:14   ` Theodore Y. Ts'o
@ 2019-10-21 20:41     ` harshad shirwadkar
  0 siblings, 0 replies; 36+ messages in thread
From: harshad shirwadkar @ 2019-10-21 20:41 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Ext4 Developers List, Andreas Dilger

On Tue, Oct 15, 2019 at 7:14 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Tue, Oct 01, 2019 at 12:40:50AM -0700, Harshad Shirwadkar wrote:
> > @@ -1858,8 +1863,9 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> >                       set_opt2(sb, EXPLICIT_DELALLOC);
> >               } else if (m->mount_opt & EXT4_MOUNT_JOURNAL_CHECKSUM) {
> >                       set_opt2(sb, EXPLICIT_JOURNAL_CHECKSUM);
> > -             } else
> > +             } else if (m->mount_opt) {
> >                       return -1;
> > +             }
> >       }
> >       if (m->flags & MOPT_CLEAR_ERR)
> >               clear_opt(sb, ERRORS_MASK);
>
> Why is this change needed?  This is in the handling of options that
> have MOPT_EXPLICIT, and it doesn't seem relevant to this commit?
You are right, this change is an irrelevant change. I'll remove it in
next version. Thanks!
>
>                                                  - Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 05/13] jbd2: fast-commit recovery path changes
  2019-10-16 17:30   ` Theodore Y. Ts'o
@ 2019-10-22  0:51     ` harshad shirwadkar
  0 siblings, 0 replies; 36+ messages in thread
From: harshad shirwadkar @ 2019-10-22  0:51 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Ext4 Developers List

On Wed, Oct 16, 2019 at 10:30 AM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Tue, Oct 01, 2019 at 12:40:54AM -0700, Harshad Shirwadkar wrote:
> > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > index 14d549445418..e0684212384d 100644
> > --- a/fs/jbd2/journal.c
> > +++ b/fs/jbd2/journal.c
> >
> >       jbd2_write_superblock(journal, write_op);
> >
> > +     if (had_fast_commit)
> > +             jbd2_set_feature_fast_commit(journal);
> > +
>
> Why the logic with had_fast_commit and (re-)setting the fast commit
> feature flag?
>
> This ties back to how we handle the logic around setting the fast
> commit flag if requested by the file system....

Fast commit feature flag serves 2 purposes: 1) If the flag is turned
on in on-disk superblock, it means that the superblock contains fast
commit blocks that should be replayed. 2) If the flag is turned on in
the in-memory representation of the superblock, it serves as an
indicator for the rest of the JBD2 code that fast commit feature is
enabled. Based on that flag, for example, the journal thread decides
to try fast commits. In this particular case, since the journal is
empty we don't want to commit fast commit feature flag on-disk but we
want to retain that flag in in-memory structure.

>
> > @@ -768,6 +816,8 @@ static int do_one_pass(journal_t *journal,
> >                       if (err)
> >                               goto failed;
> >                       continue;
> > +             case JBD2_FC_BLOCK:
> > +                     continue;
>
> Why should a Fast Commit block ever show up in the primary part of the
> journal?   It should never happen, right?
That's right, I'll fix this in next version.
>
> In which case, we should probably at least issue a warning, and not
> just skip the block.
>
>                                         - Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 09/13] ext4: fast-commit commit path changes
       [not found]     ` <CAAJeciXQiE022GqcsTr35jSqjA6eH+zBS2KNvDPj5PovButdYA@mail.gmail.com>
@ 2019-10-23 12:44       ` Theodore Y. Ts'o
  0 siblings, 0 replies; 36+ messages in thread
From: Theodore Y. Ts'o @ 2019-10-23 12:44 UTC (permalink / raw)
  To: xiaohui li; +Cc: Harshad Shirwadkar, linux-ext4

On Wed, Oct 23, 2019 at 04:58:47PM +0800, xiaohui li wrote:
> why not let fsync handle enjoy one transaction exclusively ?
> that is to say, in this transaction, there is only one handle which is
> generated in one file's fsync path .

There is only one handle which is generated in one file's fsync path.
That isn't the problem.  (If it were that simple, we would have done
it a long time ago.)

The problem is that there may have been other handles that have been
started before the fsync transaction, and these handles will have
already made changes to the file system.  Worse, some of those handles
may have made changes in the same metadata blocks which the fsync
operation needs to modify.

For example, suppose we are three seconds into the current
transaction, with potentially hundreds of handles that have already
been started and finished --- but not yet committed, because the
current transaction hasn't closed.  All of those handles have already
been attached to the current transaction, and they can't be ignored.

The fast commit patch set deals with this by using part of the journal
for a "fast commit journal" where we essentially are doing a very
simplified logical journal.  It doesn't handle all cases, and there
will be situations where we will need to fall back to the physical
journalling techniques used in ext4 today.  For example, if the file
has been truncated, and then a single 4k block is written, and then
the file gets fsync'ed, we won't be able to use the fast commit
logical journal.  Fortunately, the common case which compromises well
over 99% of most workloads are much simpler to handle, and these can
be handled via the fast commit patch.

The fast commit approach is a simplified version of the idea proposed
by Daejun Park and Dungkun Shih from the Sungkyunkwan University in
Korea, and which were presented in the paper "iJournaling:
Fine-Grained Journaling for Improving the Latency of Fsync System
Call[1]", presented at the Usenix Annual Technical Conference in 2017.

[1] https://www.usenix.org/conference/atc17/technical-sessions/presentation/park

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 11/13] ext4: add support for asynchronous fast commits
  2019-10-01  7:41 ` [PATCH v3 11/13] ext4: add support for asynchronous fast commits Harshad Shirwadkar
@ 2019-10-25  6:28   ` Xiaoguang Wang
  0 siblings, 0 replies; 36+ messages in thread
From: Xiaoguang Wang @ 2019-10-25  6:28 UTC (permalink / raw)
  To: Harshad Shirwadkar, linux-ext4

hi,

> Until this patch, fast commits could only be invoked by jbd2 thread.
> This patch allows file system to perform fast commit in an async manner
> without involving jbd2 thread. This makes fast commits even faster as
> it gets rid of the time spent in context switching to jbd2 thread. In
> order to avoid race between jbd2 thread and async fast commits, we add
> new jbd2 APIs that allow file systems to indicate their intent of
> performing an async fast commit.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> ---
>   fs/ext4/ext4.h        |  3 ++
>   fs/ext4/ext4_jbd2.c   | 74 +++++++++++++++++++++++++++++++++++++++++++
>   fs/ext4/fsync.c       |  7 ++--
>   fs/jbd2/commit.c      | 11 +++++++
>   fs/jbd2/journal.c     | 59 ++++++++++++++++++++++++++++++++++
>   fs/jbd2/transaction.c |  2 ++
>   include/linux/jbd2.h  | 10 ++++++
>   7 files changed, 164 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index cd5b567d8ca8..a8a481c5ffa4 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2716,6 +2716,9 @@ extern int ext4_group_extend(struct super_block *sb,
>   extern int ext4_resize_fs(struct super_block *sb, ext4_fsblk_t n_blocks_count);
>   
>   /* super.c */
> +int ext4_fc_async_commit(journal_t *journal, tid_t commit_tid,
> +			 tid_t commit_subtid, struct inode *inode,
> +			 struct dentry *dentry);
>   extern struct buffer_head *ext4_sb_bread(struct super_block *sb,
>   					 sector_t block, int op_flags);
>   extern int ext4_seq_options_show(struct seq_file *seq, void *offset);
> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index 12d6e70bf676..cf796268322b 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -1144,6 +1144,80 @@ static int ext4_journal_fc_replay_cb(journal_t *journal, struct buffer_head *bh,
>   	return ret;
>   }
>   
> +int ext4_fc_async_commit(journal_t *journal, tid_t commit_tid,
> +			 tid_t commit_subtid, struct inode *inode,
> +			 struct dentry *dentry)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	struct super_block *sb = inode->i_sb;
> +	struct buffer_head *bh;
> +	int ret;
> +
> +	if (!ext4_should_fast_commit(sb))
> +		return jbd2_complete_transaction(journal, commit_tid);
> +
> +	read_lock(&ei->i_fc.fc_lock);
> +	if (ei->i_fc.fc_tid != commit_tid) {
> +		read_unlock(&ei->i_fc.fc_lock);
> +		return 0;
> +	}
> +	read_unlock(&ei->i_fc.fc_lock);
> +
> +	if (ext4_is_inode_fc_ineligible(inode))
> +		return jbd2_complete_transaction(journal, commit_tid);
> +
> +	if (jbd2_commit_check(journal, commit_tid, commit_subtid))
> +		return 0;
> +
> +	ret = jbd2_start_async_fc(journal, commit_tid);
> +	if (ret)
> +		return jbd2_fc_complete_commit(journal, commit_tid,
> +					       commit_subtid);
> +
> +	trace_ext4_journal_fc_commit_cb_start(sb);
> +
> +	ret = jbd2_submit_inode_data(journal, ei->jinode);
> +	if (ret)
> +		goto out;
> +
> +	ret = jbd2_map_fc_buf(journal, &bh);
> +	if (ret) {
> +		jbd2_stop_async_fc(journal, commit_tid);
> +		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "map_fc_buf");
> +		return jbd2_complete_transaction(journal, commit_tid);
> +
> +	}
> +
> +	ret = ext4_fc_write_inode(journal, bh, inode, commit_tid,
> +				  commit_subtid, 1, dentry);
> +
> +	if (ret < 0) {
> +		brelse(bh);
> +		jbd2_stop_async_fc(journal, commit_tid);
> +		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "fc_write_inode");
> +		return jbd2_complete_transaction(journal, commit_tid);
> +	}
> +	lock_buffer(bh);
> +	clear_buffer_dirty(bh);
> +	set_buffer_uptodate(bh);
> +	bh->b_end_io = ext4_end_buffer_io_sync;
> +	submit_bh(REQ_OP_WRITE, REQ_SYNC, bh);
> +
> +	jbd2_stop_async_fc(journal, commit_tid);
> +	wait_on_buffer(bh);
> +	if (unlikely(!buffer_uptodate(bh))) {
> +		trace_ext4_journal_fc_commit_cb_stop(sb, 0, "IO");
> +		return -EIO;
> +	}
> +
> +out:
> +	trace_ext4_journal_fc_commit_cb_stop(sb,
> +					     ret < 0 ? 0 : ret,
> +					     ret >= 0 ? "success" : "fail");
> +	wake_up(&journal->j_wait_async_fc);
> +	return ret;
> +}
> +
>   void ext4_init_fast_commit(struct super_block *sb, journal_t *journal)
>   {
>   	if (ext4_should_fast_commit(sb)) {
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 5508baa11bb6..5bbfc55e1756 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -98,7 +98,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>   	struct ext4_inode_info *ei = EXT4_I(inode);
>   	journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
>   	int ret = 0, err;
> -	tid_t commit_tid;
> +	tid_t commit_tid, commit_subtid;
>   	bool needs_barrier = false;
>   
>   	if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
> @@ -148,10 +148,13 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
>   	}
>   
>   	commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
> +	commit_subtid = datasync ? ei->i_datasync_subtid : ei->i_sync_subtid;
> +
>   	if (journal->j_flags & JBD2_BARRIER &&
>   	    !jbd2_trans_will_send_data_barrier(journal, commit_tid))
>   		needs_barrier = true;
> -	ret = jbd2_complete_transaction(journal, commit_tid);
> +	ret = ext4_fc_async_commit(journal, commit_tid, commit_subtid,
> +				   inode, file->f_path.dentry);
>   	if (needs_barrier) {
>   	issue_flush:
>   		err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index e85f51e1cc70..18cb70fa2421 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -452,6 +452,17 @@ void jbd2_journal_commit_transaction(journal_t *journal, bool *fc)
>   
>   	write_lock(&journal->j_state_lock);
>   	full_commit = journal->j_do_full_commit;
> +	journal->j_running_transaction->t_async_fc_allowed = false;
> +	while (journal->j_running_transaction->t_async_fc_ongoing) {
> +		DEFINE_WAIT(wait);
> +
> +		prepare_to_wait(&journal->j_wait_async_fc, &wait,
> +				TASK_UNINTERRUPTIBLE);
> +		write_unlock(&journal->j_state_lock);
> +		schedule();
> +		write_lock(&journal->j_state_lock);
> +		finish_wait(&journal->j_wait_async_fc, &wait);
> +	}
>   	write_unlock(&journal->j_state_lock);
>   
>   	/* Let file-system try its own fast commit */
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index e0684212384d..81daa2cff67f 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -794,6 +794,64 @@ int jbd2_commit_check(journal_t *journal, tid_t tid, tid_t subtid)
>   	return 0;
>   }
>   
> +int jbd2_start_async_fc(journal_t *journal, tid_t tid)
> +{
> +	transaction_t *txn;
> +	int ret = -EINVAL;
> +
> +	if (!journal->j_running_transaction)
> +		return ret;
> +
> +	if (journal->j_running_transaction->t_tid != tid)
> +		return ret;
> +
> +	txn = journal->j_running_transaction;
> +	write_lock(&journal->j_state_lock);
> +	while (txn->t_state == T_RUNNING) {
> +		DEFINE_WAIT(wait);
> +
> +		if (txn->t_async_fc_allowed) {
> +			if (!txn->t_async_fc_ongoing) {
> +				txn->t_async_fc_ongoing = true;
> +				ret = 0;
> +				break;
> +			}
> +			prepare_to_wait(&journal->j_wait_async_fc,
> +					&wait, TASK_UNINTERRUPTIBLE);
> +			write_unlock(&journal->j_state_lock);
> +			schedule();
> +			write_lock(&journal->j_state_lock);
> +			finish_wait(&journal->j_wait_async_fc, &wait);
It seems that above code logic will prevent concurrent fsync operations using fast
commit feature?

Regards,
Xiaoguang Wang

> +		} else {
> +			ret = -ECANCELED;
> +			break;
> +		}
> +	}
> +	write_unlock(&journal->j_state_lock);
> +
> +	return ret;
> +}
> +
> +int jbd2_stop_async_fc(journal_t *journal, tid_t tid)
> +{
> +	transaction_t *txn;
> +
> +	if (!journal->j_running_transaction)
> +		return -EINVAL;
> +
> +	if (journal->j_running_transaction->t_tid != tid)
> +		return -EINVAL;
> +
> +	txn = journal->j_running_transaction;
> +	write_lock(&journal->j_state_lock);
> +	J_ASSERT(txn->t_state == T_RUNNING);
> +	txn->t_async_fc_ongoing = false;
> +	txn->t_subtid++;
> +	write_unlock(&journal->j_state_lock);
> +	return 0;
> +
> +}
> +
>   /* Return 1 when transaction with given tid has already committed. */
>   int jbd2_transaction_committed(journal_t *journal, tid_t tid)
>   {
> @@ -1308,6 +1366,7 @@ static journal_t *journal_init_common(struct block_device *bdev,
>   	init_waitqueue_head(&journal->j_wait_commit);
>   	init_waitqueue_head(&journal->j_wait_updates);
>   	init_waitqueue_head(&journal->j_wait_reserved);
> +	init_waitqueue_head(&journal->j_wait_async_fc);
>   	mutex_init(&journal->j_barrier);
>   	mutex_init(&journal->j_checkpoint_mutex);
>   	spin_lock_init(&journal->j_revoke_lock);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index ce7f03cfd90b..f17f813b5610 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -103,6 +103,8 @@ static void jbd2_get_transaction(journal_t *journal,
>   	transaction->t_max_wait = 0;
>   	transaction->t_start = jiffies;
>   	transaction->t_requested = 0;
> +	transaction->t_async_fc_allowed = true;
> +	transaction->t_async_fc_ongoing = false;
>   }
>   
>   /*
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 312103fc9581..5610f16de919 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -604,6 +604,7 @@ struct transaction_s
>   		T_FINISHED
>   	}			t_state;
>   
> +	bool t_async_fc_allowed, t_async_fc_ongoing;
>   	/*
>   	 * Where in the log does this transaction's commit start? [no locking]
>   	 */
> @@ -869,6 +870,13 @@ struct journal_s
>   	 */
>   	wait_queue_head_t	j_wait_reserved;
>   
> +	/**
> +	 * @j_wait_async_fc:
> +	 *
> +	 * Wait queue to wait for completion of async fast commits.
> +	 */
> +	wait_queue_head_t	j_wait_async_fc;
> +
>   	/**
>   	 * @j_checkpoint_mutex:
>   	 *
> @@ -1594,6 +1602,8 @@ int jbd2_complete_transaction(journal_t *journal, tid_t tid);
>   int jbd2_log_do_checkpoint(journal_t *journal);
>   int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
>   int jbd2_fc_complete_commit(journal_t *journal, tid_t tid, tid_t subtid);
> +int jbd2_start_async_fc(journal_t *journal, tid_t tid);
> +int jbd2_stop_async_fc(journal_t *journal, tid_t tid);
>   
>   void __jbd2_log_wait_for_space(journal_t *journal);
>   extern void __jbd2_journal_drop_transaction(journal_t *, transaction_t *);
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 08/13] ext4: fast-commit commit range tracking
  2019-10-16 21:36   ` Theodore Y. Ts'o
@ 2019-10-30  5:12     ` harshad shirwadkar
  0 siblings, 0 replies; 36+ messages in thread
From: harshad shirwadkar @ 2019-10-30  5:12 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Ext4 Developers List

Thanks for this, I'll remove these calls and add calls in ext4_map_blocks.

On Wed, Oct 16, 2019 at 2:36 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Tue, Oct 01, 2019 at 12:40:57AM -0700, Harshad Shirwadkar wrote:
> > With this patch, we track logical range of file offsets that need to
> > be committed using fast commit. This allows us to find file extents
> > that need to be committed during the commit time.
>
> We don't actually need to track when data is modified in the page
> cache, which is what this commit is actually doing.  We only need to
> track newly allocated blocks, at granularity of the logical block
> number.
>
> That's because we only need to force out newly allocated blocks to
> make sure we don't reveal stale data when we are in data=ordered mode.
> And it also follows that we don't need to track logical block ranges
> and submit inode data in data=writeback or data=journalled mode.
>
> In the case where the user has actually called fsync() on the the
> inode, we do a data integrity writeback in ext4_sync_file, and that's
> independent on the fast commit code.
>
> But if the file is being modified using buffered writes, or if an
> already allocated block is changed, and the file has *not* been
> changed, we don't need to write out those blocks on a fast commit.
> For example, in the case where we are the fast commit is being
> initiated via ext4_nfs_commit_metadata() -> ext4_write_inode(), we
> only care about submitting data for the newly allocated blocks.  And
> that's what we want to track here.
>
> Hence, all of the callers of ext4_fc_update_commit_range() here are in
> the wrong place.  (Also, they are calling ext4_fc_update_commit_range
> with byte offsets, when the function is expecting logical block
Thanks for pointing that out. My code as of now works with logical
file offsets instead of logical block offsets. So I should have used
file offset type instead of logical block type for arguments of
ext4_fc_update_commit_range. But it makes sense to just use logical
block offsets everywhere. I'll fix this in next version.

> numbers, but that really matter, since the existing call sites need to
> be all removed and replaced with new ones in ext4_map_blocks().
>
>                                      - Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 12/13] docs: Add fast commit documentation
  2019-10-18  1:56   ` Theodore Y. Ts'o
  2019-10-18  4:51     ` Andreas Dilger
@ 2019-10-31  5:34     ` harshad shirwadkar
  2019-10-31  6:41       ` harshad shirwadkar
  1 sibling, 1 reply; 36+ messages in thread
From: harshad shirwadkar @ 2019-10-31  5:34 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Ext4 Developers List

Thanks good point. I was trying to imitate how a jbd2 commit I guess.
There's no reason really to do this in atomic way. I'll fix this in
next version.

On Thu, Oct 17, 2019 at 6:56 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Tue, Oct 01, 2019 at 12:41:01AM -0700, Harshad Shirwadkar wrote:
> > +
> > +Multiple fast commit blocks are a part of one sub-transaction. To
> > +indicate the last block in a fast commit transaction, fc_flags field
> > +in the last block in every subtransaction is marked with "LAST" (0x1)
> > +flag. A subtransaction is valid only if all the following conditions
> > +are met:
> > +
> > +1) SUBTID of all blocks is either equal to or greater than SUBTID of
> > +   the previous fast commit block.
> > +2) For every sub-transaction, last block is marked with LAST flag.
> > +3) There are no invalid blocks in between.
>
> I'm wondering why we need to support multiple inodes being modified in
> a single transaction.  As we currently have defined what can be done,
> all updates to an inode should be free standing and not dependent on a
> change to another inode, right?  And today, one block only modifies
> one inode.
>
> The only reason why we might want to define a sub-transaction as being
> composed of multiple inodes, which must all be updated in an
> all-or-nothing fashion, is the swap boot inode ioctl, and if that's
> the only one, I wonder if it's worth the extra complexity.
>
> Am I missing anything?
>
>                                         - Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 12/13] docs: Add fast commit documentation
  2019-10-31  5:34     ` harshad shirwadkar
@ 2019-10-31  6:41       ` harshad shirwadkar
  0 siblings, 0 replies; 36+ messages in thread
From: harshad shirwadkar @ 2019-10-31  6:41 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Ext4 Developers List

Also, at high level I realized that in order to allow fast commits
being invoked from kjournald thread, the whole patch set has become
more complicated that it needs to be. In other words, if we only
support "asynchronous fast commits" in this patch set and worry about
integrating it with journald thread later, we can simplify this series
a whole lot and yet retain mostly all the functionality. Besides that
adding support of fast commits in kjournald thread would just be an in
memory change. So, just to summarize on this, 1) we will have fsync()
result in only the inode in question being fast committed in async
fashion. 2) ext4_nfs_commit_metadata() would result in all the changed
inodes result in fast commit in async fashion as well. 3) We could
very well use fast commits for normal jbd2 periodic commits as well.
But it's not clear if that will add any value, so we'll leave it out
from this patch series. Do you agree with this?

On Wed, Oct 30, 2019 at 10:34 PM harshad shirwadkar
<harshadshirwadkar@gmail.com> wrote:
>
> Thanks good point. I was trying to imitate how a jbd2 commit I guess.
> There's no reason really to do this in atomic way. I'll fix this in
> next version.
>
> On Thu, Oct 17, 2019 at 6:56 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
> >
> > On Tue, Oct 01, 2019 at 12:41:01AM -0700, Harshad Shirwadkar wrote:
> > > +
> > > +Multiple fast commit blocks are a part of one sub-transaction. To
> > > +indicate the last block in a fast commit transaction, fc_flags field
> > > +in the last block in every subtransaction is marked with "LAST" (0x1)
> > > +flag. A subtransaction is valid only if all the following conditions
> > > +are met:
> > > +
> > > +1) SUBTID of all blocks is either equal to or greater than SUBTID of
> > > +   the previous fast commit block.
> > > +2) For every sub-transaction, last block is marked with LAST flag.
> > > +3) There are no invalid blocks in between.
> >
> > I'm wondering why we need to support multiple inodes being modified in
> > a single transaction.  As we currently have defined what can be done,
> > all updates to an inode should be free standing and not dependent on a
> > change to another inode, right?  And today, one block only modifies
> > one inode.
> >
> > The only reason why we might want to define a sub-transaction as being
> > composed of multiple inodes, which must all be updated in an
> > all-or-nothing fashion, is the swap boot inode ioctl, and if that's
> > the only one, I wonder if it's worth the extra complexity.
> >
> > Am I missing anything?
> >
> >                                         - Ted

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v3 12/13] docs: Add fast commit documentation
  2019-10-18 13:28       ` Theodore Y. Ts'o
@ 2019-10-31 18:53         ` Andreas Dilger
  0 siblings, 0 replies; 36+ messages in thread
From: Andreas Dilger @ 2019-10-31 18:53 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Harshad Shirwadkar, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 1525 bytes --]

On Oct 18, 2019, at 7:28 AM, Theodore Y. Ts'o <tytso@MIT.EDU> wrote:
> 
> On Fri, Oct 18, 2019 at 01:51:56PM +0900, Andreas Dilger wrote:
>> What about rename or hard link?
> 
> Neither is currently handled by the fast commit patches, but each
> operation can fit inside a single block, so it could be handled as a
> update to a single inode.  In the case of rename, we will need to add
> some tags to indicate the desintation directory and directory enrty
> name, and whether or not there is a destination inode which needs to
> have its refcount dropped and possibly deleted.
> 
> Harshad, we probably should handle them, since in order to support
> NFS, the nfs server will send the rename or hard link request,
> followed by a commit metadata request, and that commit metadata
> request needs to persist the rename or link.  So for the purposes of
> accelerating NFS, we should handle these commands.
> 
> If we don't handle these commands, we will need to declare the inode
> as fast commit ineligible, so that we force a full journal commit when
> the commit metadata request is received.

As a simplifying assumption, you could limit the case of rename/link
within a single directory?  That handles the common case of "create
temp file, write contents there, sync, rename over original file"
used by most editors, rsync, etc.  The case of cross-directory rename
is much less common in my experience, so it is less important to
optimize that case (if this makes it easier to add to fast commits).

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2019-10-31 18:53 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-01  7:40 [PATCH v3 00/13] ext4: add fast commit support Harshad Shirwadkar
2019-10-01  7:40 ` [PATCH v3 01/13] ext4: add handling for extended mount options Harshad Shirwadkar
2019-10-16  2:14   ` Theodore Y. Ts'o
2019-10-21 20:41     ` harshad shirwadkar
2019-10-01  7:40 ` [PATCH v3 02/13] jbd2: fast commit setup and enable Harshad Shirwadkar
2019-10-16 13:03   ` Theodore Y. Ts'o
2019-10-01  7:40 ` [PATCH v3 03/13] jbd2: fast-commit commit path changes Harshad Shirwadkar
2019-10-16 16:38   ` Theodore Y. Ts'o
2019-10-01  7:40 ` [PATCH v3 04/13] jbd2: fast-commit commit path new APIs Harshad Shirwadkar
2019-10-16 17:20   ` Theodore Y. Ts'o
2019-10-01  7:40 ` [PATCH v3 05/13] jbd2: fast-commit recovery path changes Harshad Shirwadkar
2019-10-16 17:30   ` Theodore Y. Ts'o
2019-10-22  0:51     ` harshad shirwadkar
2019-10-01  7:40 ` [PATCH v3 06/13] ext4: add fields that are needed to track changed files Harshad Shirwadkar
2019-10-16 18:26   ` Theodore Y. Ts'o
2019-10-01  7:40 ` [PATCH v3 07/13] ext4: track changed files for fast commit Harshad Shirwadkar
2019-10-16 20:26   ` Theodore Y. Ts'o
2019-10-01  7:40 ` [PATCH v3 08/13] ext4: fast-commit commit range tracking Harshad Shirwadkar
2019-10-16 21:36   ` Theodore Y. Ts'o
2019-10-30  5:12     ` harshad shirwadkar
2019-10-01  7:40 ` [PATCH v3 09/13] ext4: fast-commit commit path changes Harshad Shirwadkar
2019-10-16 22:45   ` Theodore Y. Ts'o
     [not found]     ` <CAAJeciXQiE022GqcsTr35jSqjA6eH+zBS2KNvDPj5PovButdYA@mail.gmail.com>
2019-10-23 12:44       ` Theodore Y. Ts'o
2019-10-01  7:40 ` [PATCH v3 10/13] ext4: fast-commit recovery " Harshad Shirwadkar
2019-10-18  2:07   ` Theodore Y. Ts'o
2019-10-01  7:41 ` [PATCH v3 11/13] ext4: add support for asynchronous fast commits Harshad Shirwadkar
2019-10-25  6:28   ` Xiaoguang Wang
2019-10-01  7:41 ` [PATCH v3 12/13] docs: Add fast commit documentation Harshad Shirwadkar
2019-10-18  1:56   ` Theodore Y. Ts'o
2019-10-18  4:51     ` Andreas Dilger
2019-10-18 13:28       ` Theodore Y. Ts'o
2019-10-31 18:53         ` Andreas Dilger
2019-10-31  5:34     ` harshad shirwadkar
2019-10-31  6:41       ` harshad shirwadkar
2019-10-04 19:12 ` [PATCH v3 00/13] ext4: add fast commit support Theodore Y. Ts'o
2019-10-04 20:11   ` harshad shirwadkar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).